With the increased use of next-generation sequencing generating large amounts of genomic data, gene expression signatures are becoming critically important tools for the interpretation of these data, and are poised to have a substantial effect on diagnosis, management, and prognosis for a number of diseases. It is becoming crucial to establish whether the expression patterns and statistical properties of sets of genes, or gene signatures, are conserved across independent datasets. Conversely, it is necessary to compare established signatures on the same dataset to better understand how they capture different clinical or biological characteristics. Here we describe how to use sigQC, a tool that enables a streamlined, systematic approach for the evaluation of previously obtained gene signatures across multiple gene expression datasets. We implemented sigQC in an R package, making it accessible to users who have knowledge of file input/output and matrix manipulation in R and a moderate grasp of core statistical principles. SigQC has been adopted in basic biology and translational studies, including, but not limited to, the evaluation of multiple gene signatures for potential clinical use as cancer biomarkers. This protocol uses a previously obtained signature for breast cancer metastasis as an example to illustrate the critical quality control steps involved in evaluating its expression, variability, and structure in breast tumor RNA-sequencing data, a different dataset from that in which the signature was originally derived. We demonstrate how the outputs created from sigQC can be used for the evaluation of gene signatures on large-scale gene expression datasets.
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All data that have been used in this publication have been made available through Zenodo at https://doi.org/10.5281/zenodo.1319848.
All code that constitutes the sigQC R package is available for use under a GPL v3 license and can be downloaded from the CRAN repository at https://CRAN.R-project.org/package=sigQC.
All scripts used to create the figures in this paper can be downloaded through Zenodo at https://doi.org/10.5281/zenodo.1319848.
van de Vijver, M. J. et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347, 1999–2009 (2002).
Rui Liu, X. et al. The prognostic role of a gene signature from tumorigenic breast-cancer cells. N. Engl. J. Med. 356, 217–226 (2007).
Byers, L. A. et al. An epithelial–mesenchymal transition gene signature predicts resistance to EGFR and PI3K inhibitors and identifies Axl as a therapeutic target for overcoming EGFR inhibitor resistance. Clin. Cancer Res. 19, 279–290 (2013).
Liberzon, A. et al. The molecular signatures database hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Campbell, P. J. & Stratton, M. R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 3, 246–259 (2013).
Kanagal-Shamanna, R. et al. Next-generation sequencing-based multi-gene mutation profiling of solid tumors using fine needle aspiration samples: promises and challenges for routine clinical diagnostics. Mod. Pathol. 27, 314–327 (2014).
Shipp, M. A. et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8, 68–74 (2002).
Li, A. et al. Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 69, 2091–2099 (2009).
Buffa, F. M., Harris, A. L., West, C. M. & Miller, C. J. Large meta-analysis of multiple cancers reveals a common, compact and highly prognostic hypoxia metagene. Br. J. Cancer 102, 428–435 (2010).
Winter, S. C. et al. Relation of a hypoxia metagene derived from head and neck cancer to prognosis of multiple cancers. Cancer Res. 67, 3441–3449 (2007).
Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V. & Fotiadis, D. I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17 (2015).
Venet, D., Dumont, J. E. & Detours, V. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput. Biol. 7, e1002240 (2011).
Hänzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14, 7 (2013).
Barbie, D. A. et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 462, 108–112 (2009).
Tomfohr, J., Lu, J. & Kepler, T. B. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics 6, 225 (2005).
Bradley, E. & Tibshirani, R. On testing the significance of sets of genes. Ann. Appl. Stat. 6, 107–129 (2007).
Berglund, A. E., Welsh, E. A. & Eschrich, S. A. Characteristics and validation techniques for PCA-based gene-expression signatures. Int. J. Genomics 2017, 2354564 (2017).
Fox, N. S., Starmans, M. H. W., Haider, S., Lambin, P. & Boutros, P. C. Ensemble analyses improve signatures of tumour hypoxia and reveal inter-platform differences. BMC Bioinformatics 15, 170 (2014).
Masiero, M. et al. A core human primary tumor angiogenesis signature identifies the endothelial orphan receptor ELTD1 as a key regulator of angiogenesis. Cancer Cell 24, 229–241 (2013).
Harris, B. H. L., Barberis, A., West, C. M. L. & Buffa, F. M. Gene expression signatures as biomarkers of tumour hypoxia. Clin. Oncol. 27, 547–560 (2015).
Dhawan, A., Scott, J. G., Harris, A. L. & Buffa, F. M. Pan-cancer characterisation of microRNA with hallmarks of cancer reveals role of microRNA-mediated downregulation of tumour suppressor genes. Nat. Commun. 9, 5228 (2018).
van’t Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).
Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Broad Institute TCGA Genome Data Analysis Center. Analysis-Ready Standardized TCGA Data from Broad GDAC Firehose 2016_01_28 run (Broad Institute TCGA Genome Data Analysis Center, 2016).
Schulze, A. & Downward, J. Navigating gene expression using microarrays—a technology review. Nat. Cell Biol. 3, E190–E195 (2001).
Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).
Durinck, S. et al. BioMart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).
Fraley, C. & Raftery, A. E. MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-based Clustering (Department of Statistics, University of Washington, Seattle, 2006).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).
Knudsen, S. et al. Development and validation of a gene expression score that predicts response to fulvestrant in breast cancer patients. PLoS One 9, e87415 (2014).
Chen, H.-I. H., Hsiao, T.-H., Chen, Y. & Keller, C. S-score: a novel scoring method of gene signatures for molecular classification in 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 154–157 (IEEE, 2011).
Hsiao, T.-H. et al. Utilizing signature-score to identify oncogenic pathways of cholangiocarcinoma. Transl. Cancer Res. 2, 6–17 (2013).
Ebi, H. et al. Relationship of deregulated signaling converging onto mTOR with prognosis and classification of lung adenocarcinoma shown by two independent in silico analyses. Cancer Res. 69, 4027–4035 (2009).
Gibbons, D. L. et al. Expression signatures of metastatic capacity in a genetic mouse model of lung adenocarcinoma. PLoS One 4, e5401 (2009).
Hong, F. et al. RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 22, 2825–2827 (2006).
This work was funded by Cancer Research UK grant 23969 to F.M.B. (F.M.B., A.B., W.-C.C., and A.D.), the Oxford Cancer Centre (A.L.H. and A.D.), the Medical Research Council Stratified Medicine Consortium MR/M016587/1 (T.M. and E.D.), and European Research Council Consolidator Grant 772970 to F.M.B. We are also grateful for a Clarendon Scholarship to A.D.
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Key references using this protocol
Dhawan, A. et al. Nat. Commun. 9, 5228 (2018): https://doi.org/10.1038/s41467-018-07657-1
Haider, S. et al. Genome Biol. 17, 140 (2016): https://doi.org/10.1186/s13059-016-0999-8
Buffa, F. et al. Br. J. Cancer 102, 428–435 (2010): https://doi.org/10.1038/sj.bjc.6605450
Masiero, M. et al. Cancer Cell 24, 229–241 (2013): https://doi.org/10.1016/j.ccr.2013.06.004
Key data used in this protocol
van ’t Veer, L. J. et al. Nature 415, 530–536 (2002): https://doi.org/10.1038/415530a
Integrated supplementary information
Expression of signature genes across the TCGA breast cancer RNA-seq dataset for the metastasis gene signature (top) and a random set of genes (bottom), shown as (a) a barplot for the proportion of samples expressing a gene above the median, (b) a density plot showing the same information as the barplots in (a), and (c) a plot of the proportion of samples showing NA expression for each of the genes of the signature.
Comparison of median and z-transformed median of signature gene expression across the RNA-seq breast cancer dataset for the metastasis gene signature (left) and the random set of genes (right).
About this article
Cite this article
Dhawan, A., Barberis, A., Cheng, WC. et al. Guidelines for using sigQC for systematic evaluation of gene signatures. Nat Protoc 14, 1377–1400 (2019). https://doi.org/10.1038/s41596-019-0136-8