Precision annotation of digital samples in NCBI’s gene expression omnibus

Hadley, Dexter; Pan, James; El-Sayed, Osama; Aljabban, Jihad; Aljabban, Imad; Azad, Tej D.; Hadied, Mohamad O.; Raza, Shuaib; Rayikanti, Benjamin Abhishek; Chen, Bin; Paik, Hyojung; Aran, Dvir; Spatz, Jordan; Himmelstein, Daniel; Panahiazar, Maryam; Bhattacharya, Sanchita; Sirota, Marina; Musen, Mark A.; Butte, Atul J.

doi:10.1038/sdata.2017.125

Download PDF

Article
Open access
Published: 19 September 2017

Precision annotation of digital samples in NCBI’s gene expression omnibus

Dexter Hadley¹,
James Pan²,
Osama El-Sayed³,
Jihad Aljabban⁴,
Imad Aljabban⁴,
Tej D. Azad²,
Mohamad O. Hadied⁵,
Shuaib Raza⁶,
Benjamin Abhishek Rayikanti⁷,
Bin Chen¹,
Hyojung Paik¹^nAff10,
Dvir Aran ORCID: orcid.org/0000-0001-6334-5039¹,
Jordan Spatz¹,
Daniel Himmelstein ORCID: orcid.org/0000-0002-3012-7446⁸,
Maryam Panahiazar¹,
Sanchita Bhattacharya ORCID: orcid.org/0000-0002-3056-0733¹,
Marina Sirota¹,
Mark A. Musen⁹ &
…
Atul J. Butte ORCID: orcid.org/0000-0002-7433-2740¹

Scientific Data volume 4, Article number: 170125 (2017) Cite this article

795k Accesses
31 Citations
18 Altmetric
Metrics details

Subjects

Abstract

The Gene Expression Omnibus (GEO) contains more than two million digital samples from functional genomics experiments amassed over almost two decades. However, individual sample meta-data remains poorly described by unstructured free text attributes preventing its largescale reanalysis. We introduce the Search Tag Analyze Resource for GEO as a web application (http://STARGEO.org) to curate better annotations of sample phenotypes uniformly across different studies, and to use these sample annotations to define robust genomic signatures of disease pathology by meta-analysis. In this paper, we target a small group of biomedical graduate students to show rapid crowd-curation of precise sample annotations across all phenotypes, and we demonstrate the biological validity of these crowd-curated annotations for breast cancer. STARGEO.org makes GEO data findable, accessible, interoperable and reusable (i.e., FAIR) to ultimately facilitate knowledge discovery. Our work demonstrates the utility of crowd-curation and interpretation of open ‘big data’ under FAIR principles as a first step towards realizing an ideal paradigm of precision medicine.

GenomicSuperSignature facilitates interpretation of RNA-seq experiments through robust, efficient comparison to public databases

Article Open access 27 June 2022

MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature

Article Open access 19 June 2019

Integrating DNA methylation and gene expression data in a single gene network using the iNETgrate package

Article Open access 08 December 2023

Introduction

The paradigm of precision medicine^1–6 is based largely on first understanding the genomic features of disease and then designing biomarkers and drugs that identify and rescue these genomic defects respectively. Thus far, precision medicine has gained the most traction in cancer⁷ where for both non-small cell lung cancer and breast cancer, for instance, the standard-of-care now includes sequencing of genes such as EGFR or quantitating panels of RNA such as those included in Oncotype DX, respectively, to drive therapeutic decisions for new subtypes of patients⁷. Moreover, clinical trials are ongoing to develop a precision medicine approach to other diseases such as those that affect the cardiovascular^8–11 and neuropsychiatric^12,13 systems among others. In fact, the National Research Council recently affirmed that to realize the practice of precision medicine requires building a molecular taxonomy or nosology from functional gene targets defined across many different diseases¹⁴. However, the dearth of machine readable public genomics data appropriately curated over a great number of diseases has largely precluded such efforts.

Meanwhile, the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) is perhaps the largest of a number of open functional genomics data repositories^15–17, and GEO data is rich and complete over a great many diseases and phenotypes. There are currently gene expression measurements openly available on over 2 million samples drawn from experiments amassed since the year 2000^18–20. Funding agencies, such as the National Institutes of Health (NIH), mandates the public sharing this data from functional genomics experiments, and GEO doubles in size every two years on average. To date, over 21,000 PubMed publications have been derived from over 1,000,000 digital samples (see http://STARGEO.org/stats) measured by microarrays, the single largest type of genomic data within GEO¹⁹. While this data can openly be used to define a precision medicine ideal, GEO itself and almost all other attempts at crowd-curation of sample level annotations have largely failed to embrace the guiding principles to make data generated more findable, accessible, interoperable and reusable (i.e., FAIR principles)²¹ to ultimately enhance the ability of machines and individuals to leverage GEO data for downstream scientific inquiry.

For instance, while GEO itself elicits a basic level of sample curation for contrasts of phenotypes in GEO DataSets, this curation is largely used to visualize gene expression in a given study (or Series), and most importantly DataSet annotations disregard FAIR principles and are not standardized across studies. Similarly, while the Gene Expression Atlas by ArrayExpress¹⁷ employs a combination of small scale biomedical manual curation by experts with sophisticated text-mining tools to annotate samples with a structured bioontology across studies²², their approach is neither open nor embraces FAIR principles and thus cannot scale GEO’s millions of individual samples. As of October 2015, ArrayExpress had annotated 2,330 datasets studying samples in 6,345 differential comparisons across 25 different organisms (http://www.ebi.ac.uk/gxa/release-notes.html). Moreover, the few other crowd curation attempts^23,24, including some with an interactive meta-analysis portal (http://metasignature.stanford.edu/), have either failed in their scale to annotate GEO and / or failed to embrace FAIR principles that encourage sustained and ever more useful crowd curation.

With this immediate and increasing need to better mine large open data stores to foster new knowledge discovery, the NIH had a Big Data to Knowledge (BD2K) initiative to maximize the use of biomedical big data for individual investigators and the overall research community. Towards this end, we introduce the Search Tag Analyze Resource for GEO (STARGEO.org) as a NIH / BD2K-funded online platform to share open crowd-curation of digital samples. Currently, no large-scale central repository of annotations exists that the biomedical community can leverage to characterize the molecular genomic pathology of disease. STARGEO.org fills that gap by providing a convenient web-based annotation interface to facilitate precise curation of digital samples, as well as an analysis portal to easily generate robust genomics signatures from meta-analysis of the genomics data and crowd curated annotations.

Towards that end, we recruited a small group of up to 10 biomedical students to develop STARGEO.org into a structured functional genomics database of digital samples curated for relevant biological features. Specifically, we targeted graduate biomedical students as a scientific crowd enriched with disease knowledge whose members are most incentivized to learn about disease genomics in preparation for their careers. Indeed, studies have shown that levels of intrinsic motivation far outweigh extrinsic motivation in inducing crowd participation and to maintain precision of task performance²⁵. Therefore, we hypothesize that leveraging a small crowd of biomedical graduate students for the curation of biological features with STARGEO.org will result in a precisely annotated dataset of samples that may be used for large-scale translational discovery.

In this work, we demonstrate a rapid crowd-curation of sample annotations across all phenotypes, we report a high precision of annotation among curators to characterize common annotation mistakes, and we demonstrate high and significant biological validity of crowd-curated annotations on open data for characterizing the genomic pathology of breast cancer.

Results

STARGEO.org genomic discovery process

The general STARGEO.org workflow for using GEO data to define genomic signatures is shown (Fig. 1). Curators first search for free text attributes to apply Tags across multiple studies before they can analyze genomic signatures of disease. For each functional genomics experiment deposited in GEO, a Series provides a focal point and description of the whole study by linking together a group of related Samples. STARGEO.org continually downloads raw data from GEO for all the unstructured free text Sample and Series attributes (defined in the original data deposition by the study authors) for genome-wide human expression profiling by micro-array experiments. We deposit the free text attributes into a database that is indexed to facilitate full text searches of both Samples and Series attributes. This search functionality is built into STARGEO.org, allowing curators to efficiently find specific samples of interest described by specific keywords and modifiers thereby immediately facilitating Findability and Accessibility of raw GEO sample attributes under FAIR principles.

**Figure 1: STARGEO.org genomics discovery process.**

Furthermore, we keep the STARGEO.org data in sync with GEO data, which is continually being updated. Once appropriate studies to curate are found, STARGEO.org curators make annotations on Tags that represents knowledge about digital samples. Specifically, we define Tags and annotations as key:value bindings where Tags are the keys that hold annotation values. We allow users to map Tags to formal ontologies sourced from the National Center for Bioontology’s BioPortal²⁶ to immediately make their crowd-curation data Interoperable and machine readable. Also, we provide a snapshot interface for users to quickly assemble and ultimately freeze snapshots of annotations and digitally publish their snapshots to Zenodo (https://zenodo.org) to promote Reusability.

In addition, we automatically map all probe sets or Platforms deposited in GEO to the National Center for Biotechnology Information’s Entrez gene IDs²⁷ to allow users to perform robust meta-analyses across Series to define differentially expressed genes. These results we describe here are based on raw GEO data downloaded for 465,770 digital Samples from 11,903 Series (experiments) across 1,682 different Platforms (chipsets) as of December 19, 2013, and we report on 490,110 total sample annotations made on 5,798 series across 278 independent Tags made on that data through December 31, 2015.

The STARGEO.org curation process

We implemented the STARGEO.org annotation process (Fig. 2) to allow for manual curation through a simple Tagging interface based on interactive regular expressions (RegExs). Tags define a standardized nomenclature across experiments to represent biological phenotypes such as age, gender, survival, or case or control status of a disease. Specifically, we define Tags as curator-assigned key:value bindings for digital samples where the names of tags are reusable keys that are bound to sample annotation values (for example Age:50, Gender:Female, Cancer:True, etc.). When data is deposited in GEO, the submitter uses specific words or phrases in the raw data attributes to describe contrasts in sample phenotypes. RegExs have long been a standardized syntax in computer science to efficiently match and extract text²⁸, and they allow curators to select subsets of Samples in a given Series for mass annotation.

**Figure 2: STARGEO.org curation process.**

With STARGEO.org, curators design RegExs at the Series level to match and thus discriminate linked Samples in order to assign appropriate Tags. Therefore, a Series with thousands of Samples is Tagged with the same effort as a Series with only ten samples once an appropriate RegEx is used to discriminate Sample level annotations. The web application features real-time highlighting of annotations being applied to samples to make clear the result of any RegEx being applied to Tag samples (Fig. 2). Most of our curators have been able to design RegExs to match Tags that hold Boolean annotations (such as case/control) status. Although, more RegEx savvy users can use parentheses to directly ‘capture’ matching categorical (such as cancer subtype) or quantitative annotations (such as age). In our analysis of the precision of making RegExs below, we find capturing RegEx annotations to be more error prone to simply matching (Boolean) RegEx annotations, and suggest that users should explicitly enumerate Boolean matches for any given set of categorical annotations, and that users only capture quantitative phenotypes.

Crowd-curation of STARGEO.org annotations

To instantiate the database with high quality annotations, we recruited ten biomedical graduate students from across the country to curate samples for disease and other biological phenotypes, and we designed a reimbursement scheme to reward their precision in making annotations. We used word of mouth and social media to reach out to potential curators. Our sole criterion was that curators had at least some graduate level training in the biomedical sciences. We used Twitter to strategically recruit curators that would be interested in learning about disease and defining genomic disease signatures. Specifically, we sent direct messages to Twitter users with any keywords like ‘biomedicine’, ‘translational medicine’ or ‘research’ in their profile descriptions as well as key words like ‘student’ and ‘MD’ and/or ‘PhD’ to capture their educational exposure. In all, we recruited three different biomedical graduate students from the local Bay area (Stanford and UCSF), and we recruited an additional six biomedical graduate students across the United States with our Twitter outreach.

With this small crowd of curators from 12/1/2014 through 12/31/2015, we made 490,110 total sample annotations using 278 Tags across 149,380 distinct Samples drawn from 11,903 distinct Series. This represents 32% of the 465,770 digital Samples we downloaded from GEO that we have annotated with at least one Tag or 14% of the 1,639 series we downloaded from GEO (Fig. 3). We found our Series level approach to annotating Samples scaled very quickly; in about six weeks we were able to amass over 360,000 individual sample annotations among ten biomedical graduate students. To achieve this rate of coverage, we reimbursed curators to exhaust a total budget of $10,000 during the initial six-week period. We found that this initial reimbursement drove the initial rate of coverage and validation, and once we exhausted our budget, the rate of validation plateaued. Nonetheless, without any reimbursement, some students continued to annotate new samples to define differentially expressed genes and learn about the molecular pathology of disease for their own purposes. Interestingly, the figure also shows a spike in annotations over the summer months independent of any reimbursement as the students had the time and interest to contribute.

**Figure 3: Cumulative distribution of sample annotations.**

Strategically, curators were allowed the freedom to define new Tags in order to represent any phenotype of interest. We employed a text based system to define new Tags to facilitate complete flexibility to describe any biological or experimental feature. In the initial six-week period of reimbursement, virtually all Tags represented disease states except for demographics such as Age, Gender, etc. We manually controlled the vocabulary of Tags by collapsing obvious duplicates (such as ‘Breast Cancer’ and ‘BRCA’) where appropriate. For disease related phenotypes, such as case or control status, we mapped curator-supplied tags to the Disease Ontology^29,30 post-hoc to further standardize the semantic consistency of Tags across studies and to facilitate cohort selection of contrasts for meta-analyses.

Precision of STARGEO.org annotations

To test the precision in making the 490,110 sample annotations we acquired, we implemented a validation interface for blinded cross-annotation among the curators—i.e., different curators made independent annotations to check the annotation concordance as a measure of precision of already Tagged Samples. We reimbursed pairs of curators 5 cents for every concordant sample annotation to drive precision, and curators were only reimbursed for 100% concordant Sample annotations per Series. To minimize the potential for abuse of our reimbursement scheme and to ensure the highest reliability of our measured cross-annotation precision, we sought to facilitate true independence of the cross-annotations among different curators. Specifically, we hid all GEO identifier fields to completely blind the cross-annotation interface such that curators cannot easily duplicate concordant Sample annotations for a given Series. Similarly, we strategically hide RegExs submitted from users to again discourage automated cross-annotation without manual review.

With this validation and reimbursement scheme, we made 154,770 distinct annotations across 141 unique Tags (Supplementary File 1) that were blindly validated by an independent curator with an overall concordance rate of 91.1%. These annotations were made on 92,335 distinct Samples drawn from 1,193 distinct Series. Of the 141 distinct Tags that were used, 70% (98/141) were mapped post-hoc to bioontology.org covering 84% (130139/154,770) of distinct cross-annotations made (Table 1). As multiple Tags can annotate a given Series, we made 2,084 original annotations at the Series level that were blindly cross-annotated by an independent curator (Supplementary File 2). Cohen’s Kappa coefficient of agreement is a more statistically robust measure of precision than concordance^31–33, and we estimated Cohen’s Kappa coefficient for 1,827 pairs of Series containing Samples blindly cross-annotated for the same Tag (Fig. 4a). While Samples from the remaining 257 pairs of comparisons at the Series level were highly concordant, Kappa remained undefined because annotations were uniform for each Study without any variability. We found the mean Kappa estimate was 0.86, and 81% (1,487/1,827) of the comparisons had perfect Kappa coefficients of 1.0.

Table 1 Cross-annotations mapped to ontologies.

Full size table

**Figure 4: Discordance among cross-annotations.**

Besides these pairs of annotations sharing perfect agreement, the next most common pattern of agreement centered on Kappa=0, which represents random agreement in 156 comparisons (−0.25<=Kappa<=0.25). We found this random pattern of agreement between pairs of expert annotators involved mistakes in defining RegExs such as with capturing Age, the most frequent phenotype annotated initially and subsequently validated. However, other examples of random agreement involved poorly designed Tags that asked for ambiguous annotations. For instance, one example was the MB_Histology Tag for the GSE21140 Series, which is supposed to represent a histological annotation for medulloblastoma. The original RegEx captured categorical annotations of medulloblastoma histology (RegEx=‘(Classic|Desmoplastic|Large cell anaplastic|MBEN)’). However, the validation RegEx matched on whether the patient had primary medulloblastoma (RegEx=‘Primary medulloblastoma’). When grouped by Tags across multiple Series and Samples, the most discordant tags (Fig. 4b) all derived from either curator mistakes in defining a RegEx to capture a quantitative value (Onset_age and pH) or poorly designed or ambiguous Tags (MB_Histology, MB_Gender).

Additionally, there was a distinct subset of 10 pairs of sets of Sample annotations with perfect disagreement where Kappa=−1. Almost inevitably, these were mistakes made in matching the RegEx for case or control status. For instance, the largest Series with a Kappa=−1 on cross annotation of 144 Samples was for the RCC_control Tag for the GSE53757 Series, which represents control samples for renal cell carcinoma. The original annotation matched samples with normal kidney (RegEx=‘normal kidney’) while the validation annotation matched renal cell carcinoma patients (RegEx=‘clear cell renal cell carcinoma’).

Validation of STARGEO.org annotations

To validate the biological accuracy of STARGEO.org annotations, we used The Cancer Genome Atlas (TCGA)³⁴ as a gold standard for a well annotated set of samples of functional genomics data, and we compared the rank correlation of the summary statistics for the tumor-normal differential expression of STARGEO.org versus TCGA samples. In particular, breast cancer is the best represented disease among TCGA samples, and we performed differential gene analysis on RNA-Seq data from 1,119 cases of breast cancer tumors relative to 113 normal breast tissue samples as controls. We generated a comparable STARGEO.org measure of differential gene expression for breast cancer with meta-analysis (http://STARGEO.org/analysis/249/) using our crowd-curated annotations. In all, we used 1,234 tumors (cases) versus 535 normal (control) samples of breast tissue over 27 different GEO studies from STARGEO.org. Overall, we found a significant (P<=0.01) Spearman rank correlation of 0.77 (Fig. 5a) across all 19,725 gene effects estimated for STARGEO.org and TCGA data, and we found 3,168 genes that are significant at a false discovery rate of 0.1 in both TCGA and STARGEO.org after correcting for multiple tests. Moreover, among the top 200 genes (1%), we found an overlap of 92 most down-regulated versus 98 most up-regulated (Fig. 5b) shared by both STARGEO.org and TCGA analyses. This result is highly significant as an overlap of only two genes is expected by chance.

**Figure 5: STARGEO.org validation in breast cancer from TCGA.**

Discussion

Robust gene signatures discovered through public disease-related datasets have had tremendous translational impact for biomarker and drug discovery³⁵ across transplant rejection³⁶, lung cancer³⁷, pancreatic cancer³⁸, chronic renal disease³⁹, preeclampsia^40,41, and sepsis⁴² among others. However, defining robust gene signatures from public data involves a laborious process requiring substantial technical expertise to download, curate, and analyze digital samples across different datasets. While physicians and scientists are the disease experts most incentivized to annotate and subsequently interpret GEO data, the significant bioinformatics burden to do so precludes their efforts. STARGEO.org immediately solves this problem for individual researchers by providing robust meta-analyzed genomics signatures to users based on their curated annotations of digital samples through the convenient web application. Moreover, STARGEO.org provides a natural mechanism to check those curations for precision and consistency by embracing FAIR principles for crowd curation. This stands in stark contrast of other attempts to annotate GEO, including GEO itself, that disregard FAIR principles thereby handicapping the sustainability of such efforts and development of any robust digital curation community.

In this work, we introduce STARGEO.org as a novel web-based application to gain better descriptions of GEO sample phenotypes uniformly across different studies and to define robust differentially expressed gene signatures of disease by meta-analysis of gene expression. Most importantly, STARGEO.org specifically makes every free text attribute we source from GEO as well as all curation and analysis data we generate immediately FAIR. Moreover, by targeting and reimbursing a specialized crowd of biomedical graduate students, we are able to leverage STARGEO.org to curate biological features with high precision. We found that without any bioinformatics training or experience, the students we recruited were able to dynamically conduct sophisticated meta-analyses to define robust signatures of disease and ultimately discover the molecular pathology of different diseases. As a proof of principle, we demonstrate the biological accuracy of these crowd-curated annotations by significantly recapitulating differentially expressed genes that define breast cancer relative to a TCGA gold standard for a well annotated functional genomics dataset.

We acknowledge that we cannot estimate the performance of Tags to accurately capture crowd-curation of sample annotations for lack of an appropriate gold standard of annotations from the original data depositor. In fact, the gold standard of open data curation is manual review by human curators as we perform here twice with high precision. The high inter-rater reliability we observed among curators suggests that Tags can reproducibly capture the features of biological samples that the original data depositor intended to share. In the absence of an appropriate gold standard, however, it is reasonable to asses curation performance by consensus theory or majority opinion because aggregation of independent responses across curators is more accurate than any individual curator’s response^43–45, and this relationship is robust and independent of any explicit bias among curators⁴⁶. Therefore, while for lack of a gold standard it remains unclear how sensitive or specific the crowd-curation annotations are, we assume accurate annotations with high inter-rater reliability metrics we demonstrate here despite of any individual curator’s unknown bias.

Finally, STARGEO.org is designed to be for crowd curation of open data what GitHub has been for open source code development: i.e., a community of curators that can openly build large sets of annotations together. Specifically, STARGEO.org is designed to support existing best practices to make research data more findable, accessible, interoperable and reusable (i.e., FAIR principles) to ultimately facilitate knowledge discovery. We embrace FAIR principals for both the crowd-curated sample annotations we generate and the raw sample attribute free-text data that we download from GEO. By making raw GEO data Findable and Accessible, we immediately provide a valuable tool beyond the standard search interface that GEO provides. By building in ontology-mapping functionality from bioontology.org to map our Tags, we immediately make or crowd-curation data Interoperable. We provide a snapshot interface for users to quickly assemble and ultimately freeze snapshots of annotations and digitally publish their snapshots to Zenodo.org (https://zenodo.org) to promote Reusability. Therefore, by adopting FAIR principles, we may transform STARGEO.org into a translational community resource that can be used to capture open digital curation to characterize the functional genomics of disease on a large scale towards discovery of novel drugs and biomarkers in this age of precision medicine.

Methods

Using the Amazon Web Services cloud infrastructure, we downloaded over 1.7 TB of public data for all processed expression data and associated attributes for series, samples, and platforms catalogued in GEO (ftp://ftp.ncbi.nih.gov/pub/geo/DATA/), and we developed a scalable database schema to represent their attributes as Tags defined as curator-assigned key:value bindings for digital samples where key sample tags are bound to sample annotation values (for example Age:50, Gender:Female, Cancer:True, etc.) on an open-source PostgreSQL (https://www.postgresql.org) relational database management system backend. With this schema, we implemented a web application in Python (https://www.python.org) programming language using the Django (https://www.djangoproject.com) web development framework that allowed us to crowd-curate a semantic network of Tags and appropriate sample annotations representing biological diseases and other phenotypes. We also implemented the functionality for users to quickly assemble and ultimately freeze snapshots of annotations on STARGEO.org and digitally publish their annotation datasets to Zenodo.org (https://zenodo.org) for formal citation.

For the data behind the web application described here, we filtered GEO for ‘expression profiling by microarray’ in humans to find 465,770 digital Samples from 11,903 Series (experiments) across 1,682 different Platforms (chipsets) as of December 19, 2013, and we report on curations made on this raw GEO data through 12/31/2015. We full text indexed all 14,874,580 sample and 283,883 series attributes to facilitate rapid searches at the sample attribute level, a task currently impossible on GEO. We leveraged regular expressions (RegExs) in Python to design an annotation interface for curators to use to quickly annotate sample with Tags to represent biological interpretation. We integrated a blinded validation scheme that allowed for cross-annotation of Tags on which we derived measurements of precision. We used simple concordance estimates as well as Cohen’s Kappa statistic³³ to measure precision on annotations on blind cross-annotation by independent curators. Additionally, we mapped all microarray probe identifiers to Entrez gene²⁷ identifiers using the mygene.info⁴⁷ community annotation service. Finally, we designed a simple analytical interface where more advanced curators could design, compute and visualize standard genomic meta-analysis⁴⁸ of random and fixed effects across tagged and annotated digital samples.

We used STARGEO.org to define a genomic signature for breast cancer on crowd-curated and compared it with a genomic signature for breast cancer using TCGA data. We used STARGEO.org mappings, based on the mygene.info⁴⁷ gene annotation service, to map all probe identifiers to Entrez gene identifiers. For STARGEO.org, we used samples with crowd-curated annotations that were made across 1,234 cases versus 535 control samples from 27 different GEO experiments. For every gene measured in each study, we estimated the mean difference of contrasts for expression as well as the standard deviation of that mean difference. We used a standard meta-analysis with 1) fixed and 2) random effects model to combine these estimates across studies to generate meta P-values and meta effects across studies. Specifically, we used inverse variance weighting for pooling of the data across studies, and calculated weights for estimates of random effects with continuous outcome data via the DerSimonian-Laird estimate⁴⁹. We use Python to implement these analyzes in STARGEO.org. All raw GEO data, curations, and analyses that we generate are available at the http://STARGEO.org web application portal with documentation for programmatic download via a representational state transfer (ReST) application programmer interface (API) through http://STARGEO.org/docs.

For TCGA, we downloaded RNA-Seq data already preprocessed to transcript counts across genes and deposited in GEO with clinical annotations from thousands of samples from TCGA and matched controls (GSE62944). We selected 1,119 breast cancer cases versus 113 controls and performed two standard types of analyses to define differentially expressed genes: (1) A statistical T-test based on fragments per kilobase per million sequenced reads (FPKM) estimates⁵⁰, and (2) differential gene expression analysis based on the negative binomial distribution (DESeq2) method⁵¹. We used Spearman rank correlation across all four comparisons of differentially expressed genes between STARGEO.org (random versus fixed effects) meta-analyses and TCGA (FPKM versus DESeq2) analyses. Although all the comparisons were both highly and significantly correlated by Spearman rank correlation, we found that the highest correlation of the STARGEO.org breast cancer genomic signature under random effects for STARGEO.org and the FPKM for TCGA, and these results are reported as our results. To correct significance for multiple tests, we applied the Benjamini–Hochberg procedure⁵² and selected genes with false discovery rate (FDR)<0.1 (10%). For both STARGEO.org and TCGA analyses, we scaled the fold change of each gene’s effect by the significance (−log10(P-value) × fold change), and used this score to rank genes by their differential expression and estimate the overlap among the top 200 (1%) of genes⁵³ shared between the two datasets. All calculations are provided as Supplementary Data (Supplementary File 3).

Additional information

How to cite this article: Hadley, D. et al. Precision annotation of digital samples in NCBI’s gene expression omnibus. Sci. Data 4:170125 doi: 10.1038/sdata.2017.125 (2017).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Chen, R. & Snyder, M. Promise of personalized omics to precision medicine. Wiley Interdisciplinary Reviews: Systems Biology and Medicine 5, 73–82 (2013).
PubMed Google Scholar
Highnam, G. & Mittelman, D. Personal genomes and precision medicine. Genome Biol 13, 324 (2012).
Article Google Scholar
Khoury, M. J., Gwinn, M. L., Glasgow, R. E. & Kramer, B. S. A population approach to precision medicine. American Journal of Preventive Medicine 42, 639–645 (2012).
Article Google Scholar
Mirnezami, R., Nicholson, J. & Darzi, A. Preparing for Precision Medicine. New England Journal of Medicine 366, 489–491 (2012).
Article Google Scholar
Robinson, P. N. Deep phenotyping for precision medicine. Human Mutation 33, 777–780 (2012).
Article Google Scholar
Roden, D. M. & Tyndale, R. F. Genomic medicine, precision medicine, personalized medicine: what’s in a name? Clin. Pharmacol. Ther. 94, 169–172 (2013).
Article CAS Google Scholar
Butte, A. J. & Ohno-Machado, L. Making it personal: translational bioinformatics. J. Am. Med. Inform. Assoc 20, 595–596 (2013).
Article Google Scholar
Farnier, M. PCSK9: From discovery to therapeutic applications. Arch. Cardiovasc. Dis. 107, 58–66 (2014).
Article Google Scholar
Cohen, J. et al. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nat. Genet. 37, 161–165 (2005).
Article CAS Google Scholar
Abifadel, M. et al. Mutations in PCSK9 cause autosomal dominant hypercholesterolemia. Nat. Genet. 34, 154–156 (2003).
Article CAS Google Scholar
Stein, E. A. et al. Effect of a monoclonal antibody to PCSK9 on LDL cholesterol. N. Engl. J. Med. 366, 1108–1118 (2012).
Article CAS Google Scholar
Hadley, D. et al. The impact of the metabotropic glutamate receptor and other gene family interaction networks on autism. Nat. Commun 5, 4074 (2014).
Article CAS Google Scholar
Elia, J. et al. Genome-wide copy number variation study associates metabotropic glutamate receptor gene networks with attention deficit hyperactivity disorder. Nat. Genet. 44, 78–84 (2012).
Article CAS Google Scholar
National Research Council (US). Committee on A Framework for Developing a New Taxonomy of Disease. Toward Precision Medicine. The National Academies Press (National Academies Press, 2011).
Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nat. Genet. 39, 1181–1186 (2007).
Article CAS Google Scholar
Bhattacharya, S. et al. ImmPort: disseminating data to the public for the future of immunology. Immunol. Res. 58, 234–239 (2014).
Article CAS Google Scholar
Rustici, G. et al. ArrayExpress update--trends in database growth and links to data analysis tools. Nucleic Acids Res. 41, D987–D990 (2013).
Article CAS Google Scholar
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res. 39, D1005–D1010 (2011).
Article CAS Google Scholar
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 41, D991–D995 (2013).
Article CAS Google Scholar
Barrett, T. et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 37, D885–D890 (2009).
Article CAS Google Scholar
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Article Google Scholar
Malone, J. et al. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics 26, 1112–1118 (2010).
Article CAS Google Scholar
Amar, D., Hait, T., Izraeli, S. & Shamir, R. Integrated analysis of numerous heterogeneous gene expression profiles for detecting robust disease-specific biomarkers and proposing drug targets. Nucleic Acids Res. 43, 7779–7789 (2015).
Article CAS Google Scholar
Wang, Z. et al. Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd. Nat. Commun 7, 12846 (2016).
Article ADS CAS Google Scholar
Zheng, H., Li, D. & Hou, W. Task Design, Motivation, and Participation in Crowdsourcing Contests. Int. J. Electron. Commer. 15, 57–88 (2011).
Article Google Scholar
Noy, N. F. et al. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 37, W170–W173 (2009).
Article CAS Google Scholar
Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 39, D52–D57 (2011).
Article CAS Google Scholar
Kleene, S. S. C. Automata Studies. in (eds. Shannon, C. E. & McCarthy, J.) 3–41 (Princeton University Press, 1956).
Schriml, L. M. et al. Disease ontology: A backbone for disease semantic integration. Nucleic Acids Res. 40, D940–D946 (2012).
Article CAS Google Scholar
Kibbe, W. A. et al. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 43, D1071–D1078 (2015).
Article CAS Google Scholar
Sim, J. & Wright, C. C. The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Phys. Ther. 85, 257–268 (2005).
PubMed Google Scholar
Viera, A. J. & Garrett, J. M. Understanding interobserver agreement: The kappa statistic. Fam. Med. 37, 360–363 (2005).
PubMed Google Scholar
Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977).
Article CAS MATH Google Scholar
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Article Google Scholar
Dudley, J. T., Schadt, E., Sirota, M., Butte, A. J. & Ashley, E. Drug discovery in a multidimensional world: systems, patterns, and networks. J. Cardiovasc. Transl. Res. 3, 438–447 (2010).
Article Google Scholar
Chen, R. et al. Differentially expressed RNA from public microarray data identifies serum protein biomarkers for cross-organ transplant rejection and other conditions. PLoS Comput. Biol. 6, e1000940 (2010).
Article Google Scholar
Vicent, S. et al. Cross-species functional analysis of cancer-associated fibroblasts identifies a critical role for CLCF1 and IL-6 in non-small cell lung cancer in vivo. Cancer Res. 72, 5744–5756 (2012).
Article CAS Google Scholar
Sharaf, R. N. et al. Computational prediction and experimental validation associating FABP-1 and pancreatic adenocarcinoma with diabetes. BMC Gastroenterol. 11, 5 (2011).
Article ADS Google Scholar
Butte, A. A. J., Sigdel, T. K. T., Wadia, P. P. P., Miklos, D. B. & Sarwal, M. M. Protein microarrays discover angiotensinogen and PRKRIP1 as novel targets for autoantibodies in chronic renal disease. Mol. Cell. Proteomics 10, M110.000497 (2011).
Article Google Scholar
Wen, Q. et al. Peptidomic Identification of Serum Peptides Diagnosing Preeclampsia. PLoS ONE 8, e65571 (2013).
Article ADS CAS Google Scholar
Liu, L. Y. et al. Integrating multiple ‘omics’ analyses identifies serological protein biomarkers for preeclampsia. BMC Med. 11, 236 (2013).
Article ADS Google Scholar
Sweeney, T. E., Shidham, A., Wong, H. R. & Khatri, P. A comprehensive time-course-based multicohort analysis of sepsis and sterile inflammation reveals a robust diagnostic gene set. Sci. Transl. Med. 7, 287ra71 (2015).
Article Google Scholar
Hessel, S. J., Herman, P. G. & Swensson, R. G. Improving performance by multiple interpretations of chest radiographs: effectiveness and cost. Radiology 127, 589–594 (1978).
Article CAS Google Scholar
Kraemer, H. C. How many raters? toward the most reliable diagnostic consensus. Stat. Med. 11, 317–331 (1992).
Article ADS CAS Google Scholar
Metz, C. E. & Shen, J. H. Gains in accuracy from replicated readings of diagnostic images: prediction and assessment in terms of ROC analysis. Med. Decis. Making 12, 60–75 (1992).
Article CAS Google Scholar
Weller, S. C. & Mann, N. C. Assessing rater performance without a ‘gold standard’ using consensus theory. Med. Decis. Making 17, 71–79 (1997).
Article CAS Google Scholar
Wu, C., MacLeod, I. & Su, A. I. BioGPS and MyGene.info: Organizing online, gene-centric information. Nucleic Acids Res. 41, 561–565 (2013).
Article Google Scholar
Choi, J. K., Yu, U., Kim, S. & Yoo, O. J. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics 19, i84–i90 (2003).
Article Google Scholar
DerSimonian, R. & Laird, N. Meta-analysis in clinical trials. Control. Clin. Trials 7, 177–188 (1986).
Article CAS Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Article CAS Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Article Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 289–300 (1995).
Article MathSciNet MATH Google Scholar
Xiao, Y. et al. A novel significance score for gene selection and ranking. Bioinformatics 30, 801–807 (2014).
Article CAS Google Scholar

Download references

Acknowledgements

Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health (NIH) under Award Number R01GM079719, the National Institute of Allergy and Infectious Diseases of the NIH under Bioinformatics Support Contract HHSN272201200028C, the National Cancer Institute of the NIH under Award Number UH2CA203792, and the National Library of Medicine of the NIH under Award Number 1U01LM012675. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Author J.P. was supported by the Stanford University School of Medicine MedScholars program. Author J.S. was supported by the Northrop Grumman Corporation, Technology Services.

Author information

Hyojung Paik
Present address: Present address: Biomedical HPC Technology Research Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea.,

Authors and Affiliations

Institute for Computational Health Sciences, University of California, San Francisco, 94158, California, USA
Dexter Hadley, Bin Chen, Hyojung Paik, Dvir Aran, Jordan Spatz, Maryam Panahiazar, Sanchita Bhattacharya, Marina Sirota & Atul J. Butte
Department of Neurosurgery, Stanford University School of Medicine, Stanford, 94305, California, USA
James Pan & Tej D. Azad
University of Illinois College of Medicine, Chicago, 60612, Illinois, USA
Osama El-Sayed
Harvard Medical School Department of Immunology, Harvard University, Boston, 02115, Massachusetts, USA
Jihad Aljabban & Imad Aljabban
Wayne State University School of Medicine, Detroit, 48201, Michigan, USA
Mohamad O. Hadied
Yale School of Medicine, Yale University, New Haven, 06519, Connecticut, USA
Shuaib Raza
University of Vermont Medical Center, University of Vermont, Burlington, 05401, Vermont, USA
Benjamin Abhishek Rayikanti
Program in Biological & Medical Informatics, University of California, San Francisco, 94158, CA, USA
Daniel Himmelstein
Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, 94305, California, USA
Mark A. Musen

Authors

Dexter Hadley
View author publications
You can also search for this author in PubMed Google Scholar
James Pan
View author publications
You can also search for this author in PubMed Google Scholar
Osama El-Sayed
View author publications
You can also search for this author in PubMed Google Scholar
Jihad Aljabban
View author publications
You can also search for this author in PubMed Google Scholar
Imad Aljabban
View author publications
You can also search for this author in PubMed Google Scholar
Tej D. Azad
View author publications
You can also search for this author in PubMed Google Scholar
Mohamad O. Hadied
View author publications
You can also search for this author in PubMed Google Scholar
Shuaib Raza
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Abhishek Rayikanti
View author publications
You can also search for this author in PubMed Google Scholar
Bin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hyojung Paik
View author publications
You can also search for this author in PubMed Google Scholar
Dvir Aran
View author publications
You can also search for this author in PubMed Google Scholar
Jordan Spatz
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Himmelstein
View author publications
You can also search for this author in PubMed Google Scholar
Maryam Panahiazar
View author publications
You can also search for this author in PubMed Google Scholar
Sanchita Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Marina Sirota
View author publications
You can also search for this author in PubMed Google Scholar
Mark A. Musen
View author publications
You can also search for this author in PubMed Google Scholar
Atul J. Butte
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Dexter Hadley: project conceptualization, data collection, data analysis, data interpretation, manuscript drafting, manuscript revision, final manuscript approval; James Pan: data collection, data analysis; Osama El-Sayed: data collection, data analysis; Jihad Al-Jabban: data collection, data analysis; Imad Al-Jabban: data collection, data analysis; Tej Azad: data collection, data analysis; Mohamed Hadied: data collection, data analysis; Shuaib Raza: data collection, data analysis; Benjamin Abhishek Rayikanti: data collection, data analysis; Bin Chen: data collection, data analysis; Hyojung Paik: data collection, data analysis; Dvir Aran: data collection, data analysis; Jordan Spatz: data collection, data analysis; Daniel Himmelstein: data collection, data analysis; Maryam Panahiazar: data collection, data analysis, manuscript revision; Sanchita Bhattacharya: data collection, data analysis; Marina Sirota: project conceptualization; Mark Musen: project conceptualization; Atul J. Butte: project conceptualization.

Corresponding author

Correspondence to Dexter Hadley.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary File 1 (XLSX 5715 kb)

Supplementary File 2 (XLSX 144 kb)

Supplementary File 3 (XLSX 4240 kb)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Hadley, D., Pan, J., El-Sayed, O. et al. Precision annotation of digital samples in NCBI’s gene expression omnibus. Sci Data 4, 170125 (2017). https://doi.org/10.1038/sdata.2017.125

Download citation

Received: 05 January 2017
Accepted: 28 July 2017
Published: 19 September 2017
DOI: https://doi.org/10.1038/sdata.2017.125

This article is cited by

Systematic identification of ACE2 expression modulators reveals cardiomyopathy as a risk factor for mortality in COVID-19 patients
- Navchetan Kaur
- Boris Oskotsky
- Zicheng Hu
Genome Biology (2022)
An integrated analysis of prognostic and immune infiltrates for hub genes as potential survival indicators in patients with lung adenocarcinoma
- Zhiyun Xu
- Shi Wang
- Binhui Ren
World Journal of Surgical Oncology (2022)
Systematic tissue annotations of genomics samples by modeling unstructured metadata
- Nathaniel T. Hawkins
- Marc Maldaver
- Arjun Krishnan
Nature Communications (2022)
A merged microarray meta-dataset for transcriptionally profiling colorectal neoplasm formation and progression
- Michael Rohr
- Jordan Beardsley
- Deborah Altomare
Scientific Data (2021)
Mining data and metadata from the gene expression omnibus
- Zichen Wang
- Alexander Lachmann
- Avi Ma’ayan
Biophysical Reviews (2019)