Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Guidelines for using sigQC for systematic evaluation of gene signatures


With the increased use of next-generation sequencing generating large amounts of genomic data, gene expression signatures are becoming critically important tools for the interpretation of these data, and are poised to have a substantial effect on diagnosis, management, and prognosis for a number of diseases. It is becoming crucial to establish whether the expression patterns and statistical properties of sets of genes, or gene signatures, are conserved across independent datasets. Conversely, it is necessary to compare established signatures on the same dataset to better understand how they capture different clinical or biological characteristics. Here we describe how to use sigQC, a tool that enables a streamlined, systematic approach for the evaluation of previously obtained gene signatures across multiple gene expression datasets. We implemented sigQC in an R package, making it accessible to users who have knowledge of file input/output and matrix manipulation in R and a moderate grasp of core statistical principles. SigQC has been adopted in basic biology and translational studies, including, but not limited to, the evaluation of multiple gene signatures for potential clinical use as cancer biomarkers. This protocol uses a previously obtained signature for breast cancer metastasis as an example to illustrate the critical quality control steps involved in evaluating its expression, variability, and structure in breast tumor RNA-sequencing data, a different dataset from that in which the signature was originally derived. We demonstrate how the outputs created from sigQC can be used for the evaluation of gene signatures on large-scale gene expression datasets.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Overview of the visualizations produced by sigQC.
Fig. 2: Evaluation of signature gene expression and its variability.
Fig. 3: Comparison of scoring metrics summarizing signature gene expression.
Fig. 4: Analysis of correlation between signature genes (intra-signature correlation).
Fig. 5: Searching for signature structure.
Fig. 6: Summary radar plot for quality control metrics.
Fig. 7: Analysis of statistical significance of quality control metrics.

Data availability

All data that have been used in this publication have been made available through Zenodo at

Code availability

All code that constitutes the sigQC R package is available for use under a GPL v3 license and can be downloaded from the CRAN repository at

All scripts used to create the figures in this paper can be downloaded through Zenodo at


  1. 1.

    van de Vijver, M. J. et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347, 1999–2009 (2002).

    Article  Google Scholar 

  2. 2.

    Rui Liu, X. et al. The prognostic role of a gene signature from tumorigenic breast-cancer cells. N. Engl. J. Med. 356, 217–226 (2007).

    Article  Google Scholar 

  3. 3.

    Byers, L. A. et al. An epithelial–mesenchymal transition gene signature predicts resistance to EGFR and PI3K inhibitors and identifies Axl as a therapeutic target for overcoming EGFR inhibitor resistance. Clin. Cancer Res. 19, 279–290 (2013).

    CAS  Article  Google Scholar 

  4. 4.

    Liberzon, A. et al. The molecular signatures database hallmark gene set collection. Cell Syst. 1, 417–425 (2015).

    CAS  Article  Google Scholar 

  5. 5.

    Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Campbell, P. J. & Stratton, M. R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 3, 246–259 (2013).

    CAS  Article  Google Scholar 

  6. 6.

    Kanagal-Shamanna, R. et al. Next-generation sequencing-based multi-gene mutation profiling of solid tumors using fine needle aspiration samples: promises and challenges for routine clinical diagnostics. Mod. Pathol. 27, 314–327 (2014).

    CAS  Article  Google Scholar 

  7. 7.

    Shipp, M. A. et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8, 68–74 (2002).

    CAS  Article  Google Scholar 

  8. 8.

    Li, A. et al. Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 69, 2091–2099 (2009).

    CAS  Article  Google Scholar 

  9. 9.

    Buffa, F. M., Harris, A. L., West, C. M. & Miller, C. J. Large meta-analysis of multiple cancers reveals a
common, compact and highly prognostic hypoxia metagene. Br. J. Cancer 102, 428–435 (2010).

    CAS  Article  Google Scholar 

  10. 10.

    Winter, S. C. et al. Relation of a hypoxia metagene derived from head and neck cancer to prognosis of multiple cancers. Cancer Res. 67, 3441–3449 (2007).

    CAS  Article  Google Scholar 

  11. 11.

    Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V. & Fotiadis, D. I. Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17 (2015).

    CAS  Article  Google Scholar 

  12. 12.

    Venet, D., Dumont, J. E. & Detours, V. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput. Biol. 7, e1002240 (2011).

    CAS  Article  Google Scholar 

  13. 13.

    Hänzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14, 7 (2013).

    Article  Google Scholar 

  14. 14.

    Barbie, D. A. et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 462, 108–112 (2009).

    CAS  Article  Google Scholar 

  15. 15.

    Tomfohr, J., Lu, J. & Kepler, T. B. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics 6, 225 (2005).

    Article  Google Scholar 

  16. 16.

    Bradley, E. & Tibshirani, R. On testing the significance of sets of genes. Ann. Appl. Stat. 6, 107–129 (2007).

    Google Scholar 

  17. 17.

    Berglund, A. E., Welsh, E. A. & Eschrich, S. A. Characteristics and validation techniques for PCA-based gene-expression signatures. Int. J. Genomics 2017, 2354564 (2017).

    Article  Google Scholar 

  18. 18.

    Fox, N. S., Starmans, M. H. W., Haider, S., Lambin, P. & Boutros, P. C. Ensemble analyses improve signatures of tumour hypoxia and reveal inter-platform differences. BMC Bioinformatics 15, 170 (2014).

    Article  Google Scholar 

  19. 19.

    Masiero, M. et al. A core human primary tumor angiogenesis signature identifies the endothelial orphan receptor ELTD1 as a key regulator of angiogenesis. Cancer Cell 24, 229–241 (2013).

    CAS  Article  Google Scholar 

  20. 20.

    Harris, B. H. L., Barberis, A., West, C. M. L. & Buffa, F. M. Gene expression signatures as biomarkers of tumour hypoxia. Clin. Oncol. 27, 547–560 (2015).

    CAS  Article  Google Scholar 

  21. 21.

    Dhawan, A., Scott, J. G., Harris, A. L. & Buffa, F. M. Pan-cancer characterisation of microRNA with hallmarks of cancer reveals role of microRNA-mediated downregulation of tumour suppressor genes. Nat. Commun. 9, 5228 (2018).

    Article  Google Scholar 

  22. 22.

    van’t Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).

    Article  Google Scholar 

  23. 23.

    Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).

    Article  Google Scholar 

  24. 24.

    Broad Institute TCGA Genome Data Analysis Center. Analysis-Ready Standardized TCGA Data from Broad GDAC Firehose 2016_01_28 run (Broad Institute TCGA Genome Data Analysis Center, 2016).

  25. 25.

    Schulze, A. & Downward, J. Navigating gene expression using microarrays—a technology review. Nat. Cell Biol. 3, E190–E195 (2001).

    CAS  Article  Google Scholar 

  26. 26.

    Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).

    CAS  Article  Google Scholar 

  27. 27.

    Durinck, S. et al. BioMart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).

    CAS  Article  Google Scholar 

  28. 28.

    Fraley, C. & Raftery, A. E. MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-based Clustering (Department of Statistics, University of Washington, Seattle, 2006).

  29. 29.

    Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).

    CAS  Article  Google Scholar 

  30. 30.

    Knudsen, S. et al. Development and validation of a gene expression score that predicts response to fulvestrant in breast cancer patients. PLoS One 9, e87415 (2014).

    Article  Google Scholar 

  31. 31.

    Chen, H.-I. H., Hsiao, T.-H., Chen, Y. & Keller, C. S-score: a novel scoring method of gene signatures for molecular classification in 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 154–157 (IEEE, 2011).

  32. 32.

    Hsiao, T.-H. et al. Utilizing signature-score to identify oncogenic pathways of cholangiocarcinoma. Transl. Cancer Res. 2, 6–17 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Ebi, H. et al. Relationship of deregulated signaling converging onto mTOR with prognosis and classification of lung adenocarcinoma shown by two independent in silico analyses. Cancer Res. 69, 4027–4035 (2009).

    CAS  Article  Google Scholar 

  34. 34.

    Gibbons, D. L. et al. Expression signatures of metastatic capacity in a genetic mouse model of lung adenocarcinoma. PLoS One 4, e5401 (2009).

    Article  Google Scholar 

  35. 35.

    Hong, F. et al. RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 22, 2825–2827 (2006).

    CAS  Article  Google Scholar 

Download references


This work was funded by Cancer Research UK grant 23969 to F.M.B. (F.M.B., A.B., W.-C.C., and A.D.), the Oxford Cancer Centre (A.L.H. and A.D.), the Medical Research Council Stratified Medicine Consortium MR/M016587/1 (T.M. and E.D.), and European Research Council Consolidator Grant 772970 to F.M.B. We are also grateful for a Clarendon Scholarship to A.D.

Author information




F.M.B. conceived the idea and designed the study. A.D., A.B., W.-C.C., J.G.S., and F.M.B. contributed to statistics and data visualization. A.D. performed analyses. A.D., A.B., and W.-C.C. wrote and debugged code. A.B. and F.M.B. supervised the implementation. All authors contributed to application cases and interpretation of data. A.D. and F.M.B. wrote the manuscript, with contributions from all other authors.

Corresponding author

Correspondence to Francesca M. Buffa.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Key references using this protocol

Dhawan, A. et al. Nat. Commun. 9, 5228 (2018):

Haider, S. et al. Genome Biol. 17, 140 (2016):

Buffa, F. et al. Br. J. Cancer 102, 428–435 (2010):

Masiero, M. et al. Cancer Cell 24, 229–241 (2013):

Key data used in this protocol

van ’t Veer, L. J. et al. Nature 415, 530–536 (2002):

Integrated supplementary information

Supplementary Figure 1 Measures of expression of signature genes across TCGA breast cancer dataset.

Expression of signature genes across the TCGA breast cancer RNA-seq dataset for the metastasis gene signature (top) and a random set of genes (bottom), shown as (a) a barplot for the proportion of samples expressing a gene above the median, (b) a density plot showing the same information as the barplots in (a), and (c) a plot of the proportion of samples showing NA expression for each of the genes of the signature.

Supplementary Figure 2 Assessment of standardization of dataset values on gene signature score.

Comparison of median and z-transformed median of signature gene expression across the RNA-seq breast cancer dataset for the metastasis gene signature (left) and the random set of genes (right).

Supplementary information

Supplementary Text and Figures

Supplementary Figs. 1 and 2, Supplementary Table 1, and Supplementary Manuals 1 and 2

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dhawan, A., Barberis, A., Cheng, WC. et al. Guidelines for using sigQC for systematic evaluation of gene signatures. Nat Protoc 14, 1377–1400 (2019).

Download citation


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing