Compendiums of cancer transcriptomes for machine learning applications

Lim, Su Bin; Tan, Swee Jin; Lim, Wan-Teck; Lim, Chwee Teck

doi:10.1038/s41597-019-0207-2

Download PDF

Data Descriptor
Open access
Published: 08 October 2019

Compendiums of cancer transcriptomes for machine learning applications

Scientific Data volume 6, Article number: 194 (2019) Cite this article

7367 Accesses
13 Citations
6 Altmetric
Metrics details

Subjects

Abstract

There are massive transcriptome profiles in the form of microarray. The challenge is that they are processed using diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset analyses. If there exists a single, integrated data source, data-reuse can be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy. Here, we present merged microarray-acquired datasets (MMDs) across 11 major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Using machine learning algorithms, we show that diagnostic models trained from MMDs can be directly applied to RNA-seq-acquired TCGA data with high classification accuracy. Machine learning optimized MMD further aids to reveal immune landscape across various carcinomas critically needed in disease management and clinical interventions. This unified data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.

Measurement(s)	transcriptome
Technology Type(s)	digital curation
Factor Type(s)	cancer type • health status
Sample Characteristic - Organism	Homo sapiens

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.9901763

Whole transcriptome signature for prognostic prediction (WTSPP): application of whole transcriptome signature for prognostic prediction in cancer

Article 06 March 2020

Identification of transcriptional subtypes in lung adenocarcinoma and squamous cell carcinoma through integrative analysis of microarray and RNA sequencing data

Article Open access 22 April 2021

Machine learning for RNA sequencing-based intrinsic subtyping of breast cancer

Article Open access 21 August 2020

Background & summary

The Cancer Genome Atlas (TCGA) increasingly serves as a ‘training’ reference to apply machine learning algorithms, having comprehensive, well-curated genomic data of over 11,000 tumors across 33 major cancer types. In recent years, this rich resource combined with machine learning has facilitated the development of cancer classifier¹, markers predictive of drug sensitivity², histopathology image-based prognostic predictor³, and novel indices associated with oncogenic dedifferentiation⁴. There also exist vast datasets deposited at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) in the form of microarray. Applying machine learning to exploit them, however, is not straightforward; they are often generated using diverse platforms and normalization tools, and are annotated with non-standardized texts and definitions. All of these features add computational complexity to the existing high-dimensional data, necessitating multiple and intricate analytics tools for data integration and analysis.

To increase the reuse of such legacy data, we generated single, merged microarray-acquired datasets (MMD) for 11 major cancer types using a uniform R pipeline (Fig. 1). This approach has been used in our earlier work to generate merged transcriptome data of a specific cancer type, non-small cell lung cancer (NSCLC), comprising both non-tumor (NT) and tumor tissue (TT) samples⁵. The resulting MMD was used to develop a predictive multi-gene classifier, termed as tumor matrisome index (TMi), for prognosis and prediction of response to adjuvant chemotherapy among NSCLC patients⁶.

Here, we extend the framework to include various carcinomas of epithelial origin. Consistent with prior works^7,8,9,10,11, comparably correlated patterns of genome-wide differential expression (DE) were observed between microarray (MMD) and RNA-seq (TCGA). Next, we demonstrate the potential application of MMD as training data to develop clinical predictive models that can be applied cross platform. By applying CIBERSORT¹², we further show how MMDs can be used to de-convolve tumor immune microenvironment by parsing specific subpopulations of infiltrating immune cell, comparatively with TCGA datasets of matching cancer types.

Through pan-cancer analysis of MMDs, we recently identified clinically significant matrisomal changes associated with immune response and targetable immune checkpoints for a subset of cancers across different malignancies¹³. The generated cancer type-specific MMDs, the associated clinical metadata and R codes are available at ArrayExpress and figshare (see Data Records and Code Availability). Our open resource of curated large-scale transcriptomic data may provide the basis for the analytical and computational techniques to derive unbiased and new information, enabling predictive modeling for precision oncology.

Methods

MMD generation

A careful GEO search (http://www.ncbi.nlm.nih.gov/geo) was done to ensure the selection of MIAME compliant datasets having the following attributes in the original GEO submission: (1) raw data in CEL files, (2) tissue origin annotation (i.e., NT or TT), and (3) Affymetrix platform annotation. Here, only datasets generated using the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were specifically selected to ensure uniform curation of the same probe-sets (i.e., 54,675 probes). Altogether, 95 independent GEO datasets comprising a total of 8,386 samples spanning over 11 cancer types were subjected to pre-processing, normalization, batch-effect correction, data integration and analyses (Table S1). The number of NT and TT samples in each GEO dataset is summarized in Table S2.

Raw expression data from each dataset was first imported and loaded into R Bioconductor¹⁴ (RStudio version 1.1.447) using the affy package (version 1.48.0)¹⁵. The ReadAffy function was called with default parameters to read all CEL files, except for the function argument “cdfname” which was set to “hgu133plus2”. The rma function was subsequently used to normalize and background correct all the annotated probe-sets-derived expression data. This preprocessing step was applied to all 95 datasets for uniform processing and feature annotation prior to merging based on cancer type. Batch effects were identified and removed using ComBat via the inSilicoMerging package (version 1.14.0)¹⁶. Probes having maximum mean expression values across samples in each MMD were collapsed to the genes, and were annotated using the hgu133plus2SYMBOL object in the hgu133plus2.db package (version 3.2.2)¹⁷ for subsequent DE analysis.

TCGA datasets

The Cancer Genome Atlas (TCGA) data were retrieved and processed via the TCGA-Assembler package (version 2.0)¹⁸ (Table S1). Normalized RPKM count values were extracted using the ProcessRNASeqData function via the TCGA-Assembler package (version 2.0)¹⁸. Only genes with at least 1 count per million (cpm) or RPMK value in at least 20% of total number of samples in each cohort were kept via the edgeR package (version 3.12.1)¹⁹. The number of genes filtered out in each TCGA dataset is summarized in Table S3. Selected genes were normalized by Trimmed Mean of M-values (TMM), and were subjected to DE analyses using the voom and lmFit functions in the limma package (version 3.26.9)²⁰. Of note, ovarian (OV) and melanoma (SKCM) TCGA cohorts were excluded in DE and RRHO analyses due to lack of NT samples (Table S1). Clinical data including disease status (NT vs. TT) were downloaded via the DownloadBiospecimenClinicalData function in the TCGA-Assembler package (version 2.0)¹⁸.

PCA, DE and RRHO analysis

Principal component analysis (PCA) was performed using the prcomp function in the built-in R stats package (version 3.2.2). The first two PCs were visualized using the ggbiplot package (version 0.55)²¹. The lmFit and eBayes functions in the limma package (version 3.26.9)²⁰ were used to perform DE analysis. All genes annotated in each MMD and TCGA dataset were ranked by log fold change (logFC) computed based on their DE between NT and TT samples. These ranked lists were further reconstructed to only include genes that were common to both MMD- and TCGA-derived lists²² (Table S3). These files were loaded into a web-based executable simplified version of rank-rank hypergeometric overlap (RRHO) tool (http://systems.crump.ucla.edu/rankrank/rankranksimple.php). In all cases, the step size was set to 300 to generate Benjamin-Yekutieli corrected hypergeometric matrix and RRHO heatmaps.

Multi-gene classifiers

Expression data of TMi and other gene signatures of commercially available or previously validated multi-gene tests (MGTs) were extracted from all TT samples across MMD and TCGA datasets, and were loaded into Morpheus (http://software.broadinstitute.org/morpheus/) for sample stratification. The list of MGT genes and the associated references are summarized in Table S4. K-means clustering was performed with “one minus pearson correlation” metric and 1,000 iterations.

CIBERSORT

Consisting of over 1,500 samples, breast, colon, and lung MMDs exceeded the load capacity (500MB) of the CIBERSORT analysis (http://cibersort.standford.edu/)¹². 1,000 samples were thus randomly selected to generate the input “mixture” file for these MMDs. All samples in the rest of MMDs were included in the CIBERSORT analysis. Each run was performed with a default LM22 (22 immune cell types) gene signature using 100 permutations. The resulting immune cell profiles were used to compute the mean fractions of 22 immune cell types and the quantitative change between the two groups (NT vs. TT), denoted as delta (TT – NT, %), per dataset.

ROC analysis

A summary of four MGTs applied to MMDs, including gene signatures, the associated references, computation method for respective prognostic index, is provided in Table S5. Diagnostic accuracy of MGTs in classifying TT from NT samples was evaluated through the receiver operating characteristic (ROC) analysis. The area under the ROC curve (AUC), sensitivity, and specificity with the optimal cutoff for respective prognostic index were computed using the pROC package (version 1.10.0)²³.

Data Records

Our 11 MMDs are available at ArrayExpress for lung²⁴, pancreas²⁵, prostate²⁶, kidney²⁷, stomach²⁸, colon²⁹, ovary³⁰, breast³¹, liver³², bladder³³, and melanoma cancer³⁴.

Technical Validation

Principal component analysis (PCA)

PCA was performed to assess the performance of ComBat in correcting batch effects, as previously described^6,35. The first two PCs that capture the most variance are shown for both untransformed and ComBat-transformed datasets (Fig. 2). Batch-effect corrected MMDs exhibit an apparent overlay of PCs colored by the study (i.e., original dataset), and are separated by the disease status (i.e., NT vs. TT), demonstrating successful adjustment of batch effects arising from independent datasets of different sources. The PCA plots of MMD data exclusively comprising TT samples further distinguished the two risk groups (TMi_high and TMi_low) stratified by a pan-cancer multi-gene TMi classifier (Fig. S1; see Methods).

Differential expression (DE) analysis

Prior to in-depth genome-wide DE analysis, expression levels of cancer-related genes and three reference genes (i.e., GAPDH, UBB, and ACTB) were compared between the two groups (NT vs. TT) using MMDs. The selected housekeeping genes are stably expressed across tissues to maintain cellular function, and are commonly used for normalization in transcriptomics studies. While expression levels of cancer-associated gene were significantly different between NT and TT samples, that of all reference genes were almost the same in the two groups across all cancer types, validating the robustness of ComBat in adjusting technical batch effects while maintaining biological variation across samples (Fig. S2).

All MMDs were next subjected to genome-wide, limma-based DE analysis to rank all the genes by logFC based on DE between NT and TT samples (see Methods). These ranked lists were used to generate volcano plots visually depicting differentially expressed genes that met our statistical threshold (i.e., absolute value of logFC > 1 and adjusted P-value < 0.001) in TT relative to NT samples (Fig. S3 and Table S5). To validate these results in an independent cohort of patients, we processed TCGA data of matching cancer types (see Methods), and applied the same methods to construct the list of differentially expressed genes.

Rank-rank hypergeometric overlap (RRHO) analysis

RRHO algorithm³⁶ was used to assess the overlap intensity between MMD- and TCGA-derived lists of genes ranked by DE between NT and TT samples per cancer type (Fig. 3). As compared to conventional single arbitrary cut-off-based approaches, RRHO heatmaps have been widely used to visually compare genome-wide DE patterns across different species and profiling platforms, without having to correct for batch effects for the two distinct data files^36,37. A significant overlap was observed for lung, prostate, kidney, colon, breast, and liver cancer, for which RRHO map max ranged from 1083 for kidney cancer to 1592 for colorectal cancer (Fig. 3, top row). The weak correlation observed across pancreas, stomach, and bladder cancers between MMD and TCGA datasets is likely due to a relatively small number of tumor-free tissues available in respective TCGA datasets (Table S1).

To test whether this would indeed be the case, we utilized the TMi annotation (TMi_high or TMi_low) previously derived from MMD data exclusively comprising TT samples (Fig. S1), and further classified TMi group for all TCGA TT samples using the same approaches (Table S3; see Methods). Except for bladder cancer, RRHO map max increased significantly from 135 to 1014 for pancreatic cancer and 437 to 1203 for gastric cancer (Fig. 3, middle row). Similarly, highly concordant RRHO results were derived from TT subgroups stratified by other commercially available or previously validated cancer type-specific multi-gene classifiers (Fig. 3, bottom row; see Methods). These QC steps altogether demonstrate the robustness of our uniform workflow for cross-cancer analysis (Fig. S4).

Machine learning applications for predictive medicine

Cancer classifier

Publicly-accessible data repositories, such as GTEx³⁸, TCGA³⁹, HPA⁴⁰, and ArrayExpress⁴¹, host genome-wide expression profiles assayed with various profiling technologies. Having sufficient read depth¹⁰, higher resolution¹¹, higher dynamic range⁴², and lower technical variation⁴³, RNA-seq is increasingly the platform of choice in translational-biomarker studies. Paralleling this trend, cross-platform normalization tools continue to be developed, facilitating comparison of data from different platforms. PREBS⁴⁴, VOOM⁴⁵, and TDM⁴² are examplary techniques that are specifically designed to transform RNA-seq data to make it compatible with microarray data. Other conventional methods also exist in dealing with such ‘dataset shifts’⁴⁶, such as quantile normalization, log₂ transformation, and nonparanormal transformation⁴².

Using supervised machine learning, we developed new cancer classifiers trained on MMDs, and evaluated their classifying performance on their respective RNA-seq-acquired TCGA datasets (Fig. 4a). Among the existing transformation methods, TDM transformation best fitted the reference MMD data distribution (Fig. 4b). Using the glmnet package (version 2.0.13)⁴⁷, we performed LASSO multinomial logistic regression⁴⁸ with 100 fold cross-validation (CV) to build best predictive model in distinguishing TT from NT samples. Predictive model built from each MMD was then tested directly on TDM-transformed-TCGA dataset. Except for breast MMD, all MMDs achieved an average AUC of 0.96 (ranging from 0.913 to 0.997) in classifying TCGA cancers (Fig. 4c). Other commercially available MGTs, including the Myriad myplan^TM Lung Cancer, Pervenio^TM, Oncotype DX and MammaPrint, further achieved the AUC ranging from 0.714 to 0.862 (Table S6, Fig. S5; see Methods).

Pan-cancer immunogenomic analyses

TCGA data are increasingly being used to study the prognostic influence of the composition of tumor-infiltrating lymphocytes (TILs)^49,50, neoantigens^51,52 and immune cytolytic activity⁵³, all of which are putative markers predictive of clinical response to immune checkpoint inhibitor (ICI) treatments. The recent advancements in computational techniques have further facilitated high-resolution, large-scale immunogenomic analyses of the tumor-immune interface⁵⁴. Of the developed analytical pipelines, CIBERSORT serves as an exemplary in silico deconvolution method to estimate the relative proportion of 22 immune cell populations from heterogeneous bulk tissues. By applying CIBERSORT to MMDs, we next tested if the generated compendiums could further provide the basis for the developed computational infrastructure to reveal clinically significant immune landscape across multiple cancer types (see Methods).

The extent of difference in immune cell composition between the two groups (NT vs. TT) varied depending on cancer type (Fig. S6), where the estimated fractions were generally comparable (<5% difference). Specific immune cell types particularly enriched in either NT or TT group were identified, including plasma cells in lung cancer, T cells in liver cancer, and B cells in kidney, stomach, colon, breast, and bladder cancers (Fig. 5). Their enrichment was further observed in respective TCGA datasets, demonstrating the potential use of MMDs to reveal the degree and distribution of TIL density, which might be a clinically relevant prognostic and predictive indicator across various carcinomas^55,56.

Code Availability

The R codes used to preprocess, merge, and correct for batch-effects for generation of all 11 cancer type-specific MMDs can be found in figshare (https://doi.org/10.6084/m9.figshare.7878086)²². The exemplary R codes and metadata used to develop clinical predictive models using lung MMD⁵⁷ are described in our earlier works^5,6,58.

References

Yuan, Y. et al. DeepGene: an advanced cancer type classifier based on deep learning and somatic point mutations. BMC bioinformatics 17, 476 (2016).
Article PubMed PubMed Central Google Scholar
Lee, S. I. et al. A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia. Nat Commun 9, 42 (2018).
Article ADS PubMed PubMed Central Google Scholar
Yu, K. H. et al. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat Commun 7, 12474 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Malta, T. M. et al. Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation. Cell 173, 338–354 e315 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lim, S. B., Tan, S. J., Lim, W.-T. & Lim, C. T. A merged lung cancer transcriptome dataset for clinical predictive modeling. Sci Data 5, 180136 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lim, S. B., Tan, S. J., Lim, W. T. & Lim, C. T. An extracellular matrix-related prognostic and predictive indicator for early-stage non-small cell lung cancer. Nat Commun 8, 1734 (2017).
Article ADS PubMed PubMed Central Google Scholar
Wang, C. et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat Biotechnol 32, 926–932 (2014).
Article CAS PubMed PubMed Central Google Scholar
Zhao, S., Fung-Leung, W. P., Bittner, A., Ngo, K. & Liu, X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS One 9, e78644 (2014).
Article ADS PubMed PubMed Central Google Scholar
Mooney, M. et al. Comparative RNA-Seq and microarray analysis of gene expression changes in B-cell lymphomas of Canis familiaris. PLoS One 8, e61088 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Consortium, S. M.-I. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol 32, 903–914 (2014).
Article Google Scholar
Nookaew, I. et al. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res 40, 10084–10097 (2012).
Article CAS PubMed PubMed Central Google Scholar
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods 12, 453–457 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lim, S. B. et al. Pan-cancer analysis connects tumor matrisome to immune response. npj Precision. Oncology 3, 15 (2019).
Google Scholar
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80 (2004).
Article PubMed PubMed Central Google Scholar
Gautier, L., Cope, L., Bolstad, B. M. & Irizarry, R. A. affy–analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–315 (2004).
Article CAS PubMed Google Scholar
Taminau, J. et al. Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages. BMC Bioinformatics 13, 335 (2012).
Article PubMed PubMed Central Google Scholar
Carlson, M. hgu133plus2.db: Affymetrix Human Genome U133 Plus 2.0 Array annotation data (chip hgu133plus2). R package version 3.2.3 (2016).
Zhu, Y., Qiu, P. & Ji, Y. TCGA-assembler: open-source software for retrieving and processing TCGA data. Nat Methods 11, 599–600 (2014).
Article CAS PubMed PubMed Central Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Article CAS PubMed Google Scholar
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43, e47 (2015).
Article PubMed PubMed Central Google Scholar
Vu, V. Q. ggbiplot: A ggplot2 based biplot. R package version 0.55 (2011).
Lim, S. B. Compendiums of cancer transcriptome for machine learning applications. figshare. https://doi.org/10.6084/m9.figshare.7878086 (2019).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
Article PubMed PubMed Central Google Scholar
Lim, S. B. A microarray meta-dataset of lung cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6699 (2019).
Lim, S. B. A microarray meta-dataset of pancreatic cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6690 (2019).
Lim, S. B. A microarray meta-dataset of prostate cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6694 (2019).
Lim, S. B. A microarray meta-dataset of renal cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6692 (2019).
Lim, S. B. A microarray meta-dataset of gastric cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6693 (2019).
Lim, S. B. A microarray meta-dataset of colorectal cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6698 (2019).
Lim, S. B. A microarray meta-dataset of ovarian cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6691 (2019).
Lim, S. B. A microarray meta-dataset of breast cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6703 (2019).
Lim, S. B. A microarray meta-dataset of liver cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6695 (2019).
Lim, S. B. A microarray meta-dataset of bladder cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6696 (2019).
Lim, S. B. A microarray meta-dataset of melanoma cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6697 (2019).
Lim, S. B. et al. Addressing cellular heterogeneity in tumor and circulation for refined prognostication. Proc. Natl Acad. Sci. USA 116, 17957–17962 (2019).
Article CAS PubMed PubMed Central Google Scholar
Plaisier, S. B., Taschereau, R., Wong, J. A. & Graeber, T. G. Rank-rank hypergeometric overlap: identification of statistically significant overlap between gene-expression signatures. Nucleic Acids Res 38, e169 (2010).
Article PubMed PubMed Central Google Scholar
Cahill, K. M., Huo, Z., Tseng, G. C., Logan, R. W. & Seney, M. L. Improved identification of concordant and discordant gene expression signatures using an updated rank-rank hypergeometric overlap approach. Sci Rep 8, 9588 (2018).
Article ADS PubMed PubMed Central Google Scholar
Consortium, G. T. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585 (2013).
Article Google Scholar
Atlas Research, CancerGenome N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 45, 1113–1120 (2013).
Article Google Scholar
Ting, D. T. et al. Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep 8, 1905–1918 (2014).
Article CAS PubMed PubMed Central Google Scholar
Rustici, G. et al. ArrayExpress update–trends in database growth and links to data analysis tools. Nucleic Acids Res 41, D987–990 (2013).
Article CAS PubMed Google Scholar
Thompson, J. A., Tan, J. & Greene, C. S. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ 4, e1621 (2016).
Article PubMed PubMed Central Google Scholar
Wilhelm, B. T. & Landry, J. R. RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48, 249–257 (2009).
Article CAS PubMed Google Scholar
Uziela, K. & Honkela, A. Probe Region Expression Estimation for RNA-Seq Data for Improved Microarray Comparability. PloS one 10, e0126545 (2015).
Article PubMed PubMed Central Google Scholar
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15, R29 (2014).
Article PubMed PubMed Central Google Scholar
Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003).
Article CAS PubMed Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33, 1–22 (2010).
Article PubMed PubMed Central Google Scholar
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 267–288 (1996).
Article Google Scholar
Gentles, A. J. et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat Med 21, 938–945 (2015).
Article CAS PubMed PubMed Central Google Scholar
Iglesia, M. D. et al. Genomic Analysis of Immune Cell Infiltrates Across 11 Tumor Types. J Natl Cancer Inst 108, djw144 (2016).
Article PubMed Central Google Scholar
Brown, S. D. et al. Neo-antigens predicted by tumor genome meta-analysis correlate with increased patient survival. Genome Res 24, 743–750 (2014).
Article CAS PubMed PubMed Central Google Scholar
Charoentong, P. et al. Pan-cancer Immunogenomic Analyses Reveal Genotype-Immunophenotype Relationships and Predictors of Response to Checkpoint Blockade. Cell Rep 18, 248–262 (2017).
Article CAS PubMed Google Scholar
Rooney, M. S., Shukla, S. A., Wu, C. J., Getz, G. & Hacohen, N. Molecular and genetic properties of tumors associated with local immune cytolytic activity. Cell 160, 48–61 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hackl, H., Charoentong, P., Finotello, F. & Trajanoski, Z. Computational genomics tools for dissecting tumour-immune cell interactions. Nat Rev Genet 17, 441–458 (2016).
Article CAS PubMed Google Scholar
Gnjatic, S. et al. Identifying baseline immune-related biomarkers to predict clinical outcome of immunotherapy. J Immunother Cancer 5, 44 (2017).
Article PubMed PubMed Central Google Scholar
Gibney, G. T., Weiner, L. M. & Atkins, M. B. Predictive biomarkers for checkpoint inhibitor-based immunotherapy. Lancet Oncol 17, e542–e551 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lim, S. B. A microarray meta-dataset of non-small cell lung cancer. ArrayExpress, https://identifiers.org/arrayexpress:E-MTAB-6043 (2018).
Lim, S. B. An extracellular matrix-related prognostic and predictive indicator for early-stage non-small cell lung cancer. figshare. https://doi.org/10.6084/m9.figshare.5350321 (2018).

Download references

Acknowledgements

This work was conceived and carried out at the MechanoBioEngineering laboratory at the Department of Biomedical Engineering, National University of Singapore (NUS). We acknowledge support provided by the National Research Foundation, Prime Minister’s Office, Singapore under its Research Centre for Excellence, Mechanobiology Institute at NUS. W.-T.L. is supported by the National Medical Research Council (NMRC/CSA/040/2012 and NMRC/CSA-INV/0025/2017). S.B.L. is supported by NUS Graduate School for Integrative Sciences and Engineering (NGS), Mogam Science Scholarship Foundation, and Daewoong Foundation.

Author information

Authors and Affiliations

NUS Graduate School for Integrative Sciences & Engineering, National University of Singapore, Singapore, Singapore
Su Bin Lim & Chwee Teck Lim
Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore
Su Bin Lim & Chwee Teck Lim
Regional Scientific Affairs, Sysmex Asia Pacific, Singapore, Singapore
Swee Jin Tan
Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
Wan-Teck Lim
Office of Academic and Clinical Development, Duke-NUS Medical School, Singapore, Singapore
Wan-Teck Lim
IMCB NCC MPI Singapore Oncogenome Laboratory, Institute of Molecular and Cell Biology (IMCB), A*STAR, Singapore, Singapore
Wan-Teck Lim
Mechanobiology Institute, National University of Singapore, Singapore, Singapore
Chwee Teck Lim
Institute for Health Innovation and Technology (iHealthtech), National University of Singapore, Singapore, Singapore
Chwee Teck Lim

Authors

Su Bin Lim
View author publications
You can also search for this author in PubMed Google Scholar
Swee Jin Tan
View author publications
You can also search for this author in PubMed Google Scholar
Wan-Teck Lim
View author publications
You can also search for this author in PubMed Google Scholar
Chwee Teck Lim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.B.L., S.J.T., W.-T.L. and C.T.L. conceptualized and designed the study. S.B.L. developed the R pipeline to generate MMDs. S.B.L., S.J.T., W.-T.L. and C.T.L. analyzed and interpreted the data. S.B.L., S.J.T., W.-T.L. and C.T.L. reviewed and contributed to the manuscript.

Corresponding author

Correspondence to Chwee Teck Lim.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.

Reprints and permissions

About this article

Cite this article

Lim, S.B., Tan, S.J., Lim, WT. et al. Compendiums of cancer transcriptomes for machine learning applications. Sci Data 6, 194 (2019). https://doi.org/10.1038/s41597-019-0207-2

Download citation

Received: 15 February 2019
Accepted: 25 July 2019
Published: 08 October 2019
DOI: https://doi.org/10.1038/s41597-019-0207-2

This article is cited by

Game-theoretic link relevance indexing on genome-wide expression dataset identifies putative salient genes with potential etiological and diapeutics role in colorectal cancer
- Vishwa Jyoti Baruah
- Papori Neog Bora
- Surajit Borkotokey
Scientific Reports (2022)
A merged microarray meta-dataset for transcriptionally profiling colorectal neoplasm formation and progression
- Michael Rohr
- Jordan Beardsley
- Deborah Altomare
Scientific Data (2021)

Subjects

Abstract

Similar content being viewed by others

Whole transcriptome signature for prognostic prediction (WTSPP): application of whole transcriptome signature for prognostic prediction in cancer

Identification of transcriptional subtypes in lung adenocarcinoma and squamous cell carcinoma through integrative analysis of microarray and RNA sequencing data

Machine learning for RNA sequencing-based intrinsic subtyping of breast cancer

Background & summary

Methods

MMD generation

TCGA datasets

PCA, DE and RRHO analysis

Multi-gene classifiers

CIBERSORT

ROC analysis

Data Records

Technical Validation

Principal component analysis (PCA)

Differential expression (DE) analysis

Rank-rank hypergeometric overlap (RRHO) analysis

Machine learning applications for predictive medicine

Cancer classifier

Pan-cancer immunogenomic analyses

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Supplementary Information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Game-theoretic link relevance indexing on genome-wide expression dataset identifies putative salient genes with potential etiological and diapeutics role in colorectal cancer

A merged microarray meta-dataset for transcriptionally profiling colorectal neoplasm formation and progression

Search

Quick links