OCTAD: an open workspace for virtually screening therapeutics targeting precise cancer patient groups using gene expression features


As the field of precision medicine progresses, treatments for patients with cancer are starting to be tailored to their molecular as well as their clinical features. The emerging cancer subtypes defined by these molecular features require that dedicated resources be used to assist the discovery of drug candidates for preclinical evaluation. Voluminous gene expression profiles of patients with cancer have been accumulated in public databases, enabling the creation of cancer-specific expression signatures. Meanwhile, large-scale gene expression profiles of cellular responses to chemical compounds have also recently became available. By matching the cancer-specific expression signature to compound-induced gene expression profiles from large drug libraries, researchers can prioritize small molecules that present high potency to reverse expression of signature genes for further experimental testing of their efficacy. This approach has proven to be an efficient and cost-effective way to identify efficacious drug candidates. However, the success of this approach requires multiscale procedures, imposing considerable challenges to many labs. To address this, we developed Open Cancer TherApeutic Discovery (OCTAD; http://octad.org): an open workspace for virtually screening compounds targeting precise groups of patients with cancer using gene expression features. Its database includes 19,127 patient tissue samples covering more than 50 cancer types and expression profiles for 12,442 distinct compounds. The program is used to perform deep-learning-based reference tissue selection, disease gene expression signature creation, drug reversal potency scoring and in silico validation. OCTAD is available as a web portal and a standalone R package to allow experimental and computational scientists to easily navigate the tool.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Systems description.
Fig. 2: OCTAD cancer maps.
Fig. 3: OCTAD compounds.
Fig. 4
Fig. 5: Screen compounds targeting HCC.
Fig. 6: Correlation between sRGES and efficacy data under different parameter values in HCC.
Fig. 7: Evaluation of the results from major steps in HCC prediction.

Data availability

The data related to this protocol can be found at http://octad.org/download or https://www.synapse.org/#!Synapse:syn22101254. You can also refer to the preprint version of our protocol: https://www.biorxiv.org/content/10.1101/821546v1. This pipeline was verified in our previous research papers.

Software availability

The software is available from http://octad.org/download or https://www.synapse.org/#!Synapse:syn22101254.


  1. 1.

    Balamuth, N. J. & Womer, R. B. Ewing’s sarcoma. Lancet Oncol. 11, 184–192 (2010).

    CAS  PubMed  Google Scholar 

  2. 2.

    Torre, L. A. et al. Global cancer statistics, 2012. CA Cancer J. Clin. 65, 87–108 (2015).

    PubMed  Google Scholar 

  3. 3.

    Genetic and Rare Diseases Information Center, National Institutes of Health. FAQs About Rare Diseases. https://rarediseases.info.nih.gov/diseases/pages/31/faqs-about-rare-diseases (2020).

  4. 4.

    Sirota, M. et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Science Tranl. Med. 3, 96ra77 (2011).

    CAS  Google Scholar 

  5. 5.

    Jahchan, N. S. et al. A drug repositioning approach identifies tricyclic antidepressants as inhibitors of small cell lung cancer and other neuroendocrine tumors. Cancer Discov. 3, 1364–1377 (2013).

    CAS  PubMed  Google Scholar 

  6. 6.

    van Noort, V. et al. Novel drug candidates for the treatment of metastatic colorectal cancer through global inverse gene-expression profiling. Cancer Res. 74, 5690–5699 (2014).

    PubMed  Google Scholar 

  7. 7.

    Brum, A. M. et al. Connectivity Map-based discovery of parbendazole reveals targetable human osteogenic pathway. Proc. Natl Acad. Sci. USA 112, 12711–12716 (2015).

    CAS  PubMed  Google Scholar 

  8. 8.

    Chen, B. et al. Computational discovery of niclosamide ethanolamine, a repurposed drug candidate that reduces growth of hepatocellular carcinoma cells in vitro and in mice by inhibiting cell division cycle 37 signaling. Gastroenterology 152, 2022–2036 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Pessetto, Z. Y. et al. In silico and in vitro drug screening identifies new therapeutic approaches for Ewing sarcoma. Oncotarget 8, 4079–4095 (2017).

    PubMed  Google Scholar 

  10. 10.

    Mirza, A. N. et al. Combined inhibition of atypical PKC and histone deacetylase 1 is cooperative in basal cell carcinoma treatment. JCI Insight 2, e97071 (2017).

  11. 11.

    Chen, B. et al. Reversal of cancer gene expression correlates with drug efficacy and reveals therapeutic targets. Nat. Commun. 8, 16022 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Lamb, J. et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935 (2006).

    CAS  Google Scholar 

  13. 13.

    Subramanian, A. et al. A next generation Connectivity Map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Corsello, S. M. et al. The Drug Repurposing Hub: a next-generation drug library and information resource. Nat. Med. 23, 405–408 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Zeng, W. Z. D., Glicksberg, B. S., Li, Y. & Chen, B. Selecting precise reference normal tissue samples for cancer research using a deep learning approach. BMC Med. Genomics 12, 21 (2019).

    PubMed  PubMed Central  Google Scholar 

  16. 16.

    Aran, D., Sirota, M. & Butte, A. J. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 6, 8971 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Yoshihara, K. et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat. Commun. 4, 2612 (2013).

    PubMed  PubMed Central  Google Scholar 

  18. 18.

    Yu, K. et al. Comprehensive transcriptomic analysis of cell lines as models of primary tumors across 22 tumor types. Nat. Commun. 10, 3574 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  PubMed  Google Scholar 

  21. 21.

    Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Liu, K. et al. Evaluating cell lines as models for metastatic breast cancer through integrative analysis of genomic data. Nat. Commun. 10, (2019).

  23. 23.

    Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechol. 32, 896–902 (2014).

    CAS  Google Scholar 

  24. 24.

    Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

    CAS  Google Scholar 

  25. 25.

    Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    PubMed  PubMed Central  Google Scholar 

  26. 26.

    Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, (2014).

  27. 27.

    Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013).

    PubMed  PubMed Central  Google Scholar 

  28. 28.

    Costa-Silva, J., Domingues, D. & Lopes, F. M. RNA-Seq differential expression analysis: An extended review and a software tool. PLoS ONE 12, (2017).

  29. 29.

    Sterling, T. & Irwin, J. J. ZINC 15–ligand discovery for everyone. J. Chem. Inf. Model 55, 2324–2337 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).

    PubMed  PubMed Central  Google Scholar 

  31. 31.

    Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019).

    PubMed  Google Scholar 

  32. 32.

    Bento, A. P. et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083–D1090 (2014).

    CAS  PubMed  Google Scholar 

  33. 33.

    Chen, B., Sirota, M., Fan-Minogue, H., Hadley, D. & Butte, A. J. Relating hepatocellular carcinoma tumor samples and cell lines using gene expression data in translational research. BMC Med. Genomics 8, S5 (2015).

    PubMed  PubMed Central  Google Scholar 

  34. 34.

    Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Seashore-Ludlow, B. et al. Harnessing connectivity in a large-scale small-molecule sensitivity dataset. Cancer Discov. 5, 1210–1223 (2015).

  36. 36.

    Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013).

    PubMed  PubMed Central  Google Scholar 

  38. 38.

    McFerrin, L. G. et al. Analysis and visualization of linked molecular and clinical cancer data by using Oncoscape. Nat. Genet. 50, 1203–1204 (2018).

    CAS  PubMed  Google Scholar 

  39. 39.

    Newton, Y. et al. TumorMap: exploring the molecular similarities of cancer samples in an interactive portal. Cancer Res. 77, e111–e114 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Schmid, M. W. & Grossniklaus, U. Rcount: simple and flexible RNA-Seq read counting. Bioinformatics 31, 436–437 (2015).

    CAS  PubMed  Google Scholar 

  41. 41.

    Lachmann, A. et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat. Commun. 9, 1366 (2018).

    PubMed  PubMed Central  Google Scholar 

  42. 42.

    Huang, D. W. et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 8, R183 (2007).

    PubMed  PubMed Central  Google Scholar 

  43. 43.

    Kucukural, A., Yukselen, O., Ozata, D. M., Moore, M. J. & Garber, M. DEBrowser: interactive differential expression analysis and visualization tool for count data. BMC Genomics 20, 6 (2019).

    PubMed  PubMed Central  Google Scholar 

  44. 44.

    Wu, H., Huang, J., Zhong, Y. & Huang, Q. DrugSig: a resource for computational drug repositioning utilizing gene expression signatures. PLoS ONE 12, e0177743 (2017).

    PubMed  PubMed Central  Google Scholar 

  45. 45.

    Moosavinasab, S. et al. ‘RE:fine drugs’: an interactive dashboard to access drug repurposing opportunities. Database 2016, baw083 (2016).

  46. 46.

    Lee, B. K. B. et al. DeSigN: connecting gene expression with therapeutics for drug repurposing and development. BMC Genomics 18, 934 (2017).

    PubMed  PubMed Central  Google Scholar 

  47. 47.

    Wang, Z. et al. Drug Gene Budger (DGB): an application for ranking drugs to modulate a specific gene based on transcriptomic signatures. Bioinformatics 35, 1247–1248.

  48. 48.

    Shameer, K. et al. Systematic analyses of drugs and disease indications in RepurposeDB reveal pharmacological, biological and epidemiological factors influencing drug repositioning. Brief. Bioinformatics 19, 656–678 (2018).

    CAS  PubMed  Google Scholar 

  49. 49.

    Brown, A. S. & Patel, C. J. A standard database for drug repositioning. Sci. Data 4, 170029 (2017).

    PubMed  PubMed Central  Google Scholar 

  50. 50.

    Chen, B. et al. Harnessing big ‘omics’ data and AI for drug discovery in hepatocellular carcinoma. Nat. Rev. Gastroenterol. Hepatol. 17, 238–251 (2020).

    PubMed  PubMed Central  Google Scholar 

  51. 51.

    Smirnov, P. et al. PharmacoGx: an R package for analysis of large pharmacogenomic datasets. Bioinformatics 32, 1244–1246 (2016).

    CAS  PubMed  Google Scholar 

  52. 52.

    Hänzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics 14, 7 (2013).

    PubMed  PubMed Central  Google Scholar 

  53. 53.

    Dang, C. V. MYC on the path to cancer. Cell 149, 22–35 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54.

    Courtney, K. D., Corcoran, R. B. & Engelman, J. A. The PI3K pathway as drug target in human cancer. J. Clin. Oncol. 28, 1075–1083 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. 55.

    Glicksberg, B. S., Li, L., Chen, R., Dudley, J. & Chen, B. Leveraging big data to transform drug discovery. Methods Mol. Biol. 1939, 91–118 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Robinson, D. R. et al. Integrative clinical genomics of metastatic cancer. Nature 548, 297–303 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references


The research is supported by R01GM134307, R21 TR001743 and K01 ES028047 and the MSU Global Impact Initiative. Amazon AWS research credits were received to support portal development and hosting. The portal was developed with help from Optra Health and MSU IT. The content is solely the responsibility of the authors and does not necessarily represent the official views of sponsors.

Author information




B.Z. led the project, developed the desktop version and wrote the manuscript. B.S.G. and P.N. co-led the project, coordinated web portal development and wrote the manuscript. E.C. developed the R package, led the revision and wrote the manuscript. J.X. performed the LINCS compound analysis, developed compound enrichment analysis and tested the desktop package. K.L. implemented the Toil pipeline and processed RNA-Seq samples. A.W. helped develop the code, prepared tutorials and created case studies. C.C. helped with troubleshooting. B.C. developed the initial code, wrote the manuscript and supervised the project.

Corresponding author

Correspondence to Bin Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Protocols thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Key references using this protocol

Chen, B. et al. Nat. Commun. 8, 16022 (2017): https://doi.org/10.1038/s41467-019-10148-6

Chen, B. et al. Gastroenterology 152, 2022–2036 (2017): https://doi.org/10.1053/j.gastro.2017.02.039

Zeng, W. Z. D., Glicksberg, B. S., Li, Y. & Chen, B. BMC Med. Genomics 12, 21 (2019): https://doi.org/10.1186/s12920-018-0463-6

Liu, K. et al. Nat. Commun. 10, 2138 (2019): https://doi.org/10.1038/s41467-019-10148-6

Extended data

Extended Data Fig. 1 Screenshots of the web portal.

(a) Disease sample selection, (b) control sample selection, (c) drug prediction job submission, (e) job management, (f) predicted drug list and (g) result files.

Supplementary information

Supplementary Information

Supplementary Text and Supplementary Figs. 1–5.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zeng, B., Glicksberg, B.S., Newbury, P. et al. OCTAD: an open workspace for virtually screening therapeutics targeting precise cancer patient groups using gene expression features. Nat Protoc 16, 728–753 (2021). https://doi.org/10.1038/s41596-020-00430-z

Download citation


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing