Abstract
Cancer of unknown primary (CUP) is a type of cancer that cannot be traced back to its primary site and accounts for 3–5% of all cancers. Established targeted therapies are lacking for CUP, leading to generally poor outcomes. We developed OncoNPC, a machine-learning classifier trained on targeted next-generation sequencing (NGS) data from 36,445 tumors across 22 cancer types from three institutions. Oncology NGS-based primary cancer-type classifier (OncoNPC) achieved a weighted F1 score of 0.942 for high confidence predictions (\(\ge 0.9\)) on held-out tumor samples, which made up 65.2% of all the held-out samples. When applied to 971 CUP tumors collected at the Dana-Farber Cancer Institute, OncoNPC predicted primary cancer types with high confidence in 41.2% of the tumors. OncoNPC also identified CUP subgroups with significantly higher polygenic germline risk for the predicted cancer types and with significantly different survival outcomes. Notably, patients with CUP who received first palliative intent treatments concordant with their OncoNPC-predicted cancers had significantly better outcomes (hazard ratio (HR) = 0.348; 95% confidence interval (CI) = 0.210–0.570; P = \(2.32\times {10}^{-5}\)). Furthermore, OncoNPC enabled a 2.2-fold increase in patients with CUP who could have received genomically guided therapies. OncoNPC thus provides evidence of distinct CUP subgroups and offers the potential for clinical decision support for managing patients with CUP.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout





Data availability
The multicenter NGS tumor panel sequencing data is available upon request at the AACR Project GENIE website: https://www.aacr.org/professionals/research/aacr-project-genie/. The fully trained OncoNPC model, processed somatic variants data from Profile DFCI and deidentified clinical data used in the treatment concordance analysis are available in https://github.com/itmoon7/onconpc.
Code availability
We used the R (v4.0.2) and Python (v3.9.13) programming languages for OncoNPC feature processing (R deconstructSigs v1.8.0), OncoNPC model development and interpretation (Python xgboost v1.2.0, shap v0.41.0) and survival analysis (R survival v3.2.7, stats v4.0.2, Python lifelines v0.27.4, scipy v1.7.1). Please see https://github.com/itmoon7/onconpc for the preprocessing script, the fully trained OncoNPC model, a notebook demonstration on how to use OncoNPC and other reference materials.
References
Pavlidis, N., Khaled, H. & Gaafar, R. A mini review on cancer of unknown primary site: a clinical puzzle for the oncologists. J. Adv. Res. 6, 375–382 (2015).
Varadhachary, G. R. & Raber, M. N. Cancer of unknown primary site. N. Engl. J. Med. 371, 757–765 (2014).
Hyman, D. M. et al. Vemurafenib in multiple nonmelanoma cancers with BRAF V600 mutations. N. Engl. J. Med. 373, 726–736 (2015).
Hainsworth, J. D. & Greco, F. A. Cancer of unknown primary site: new treatment paradigms in the era of precision medicine. Am. Soc. Clin. Oncol. Educ. Book 38, 20–25 (2018).
Anderson, G. G. & Weiss, L. M. Determining tissue of origin for metastatic cancers: meta-analysis and literature review of immunohistochemistry performance. Appl. Immunohistochem. Mol. Morphol. 18, 3–8 (2010).
Oien, K. & Dennis, J. Diagnostic work-up of carcinoma of unknown primary: from immuno-histochemistry to molecular profiling. Ann. Oncol. 23, 271–277 (2012).
Moran, S. et al. Epigenetic profiling to classify cancer of unknown primary: a multicentre, retrospective analysis. Lancet Oncol. 17, 1386–1395 (2016).
Jiao, W. et al. A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nat. Commun. 11, 728 (2020).
Penson, A. et al. Development of genome-derived tumor type prediction to inform clinical cancer care. JAMA Oncol. 6, 84–91 (2020).
He, B. et al. A neural network framework for predicting the tissue-of-origin of 15 common cancer types based on RNA-seq data. Front. Bioeng. Biotechnol. 8, 737 (2020).
Nguyen, L., Van Hoeck, A. & Cuppen, E. Machine learning-based tissue of origin classification for cancer of unknown primary diagnostics using genome-wide mutation features. Nat. Commun. 13, 4013 (2022).
Posner, A. et al. A comparison of DNA sequencing and gene expression profiling to assist tissue of origin diagnosis in cancer of unknown primary. J. Pathol. 259, 81–92 (2023).
Zhao, Y. et al. CUP-AI-Dx: a tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence. EBioMedicine 61, 103030 (2020).
Consortium, A. P. G. et al. AACR project GENIE: powering precision medicine through an international consortium. Cancer Discov. 7, 818–831 (2017).
Hainsworth, J. D. et al. Molecular gene expression profiling to predict the tissue of origin and direct site-specific therapy in patients with carcinoma of unknown primary site: a prospective trial of the Sarah Cannon Research Institute. J. Clin. Oncol. 31, 217–223 (2013).
Yoon, H. et al. Gene expression profiling identifies responsive patients with cancer of unknown primary treated with carboplatin, paclitaxel, and everolimus: NCCTG N0871 (alliance). Ann. Oncol. 27, 339–344 (2016).
Hayashi, H. et al. Site-specific and targeted therapy based on molecular profiling by next-generation sequencing for cancer of unknown primary site: a nonrandomized phase 2 clinical trial. JAMA Oncol. 6, 1931–1938 (2020).
Hayashi, H. et al. Randomized phase II trial comparing site-specific treatment based on gene expression profiling with carboplatin and paclitaxel for patients with cancer of unknown primary site. J. Clin. Oncol. 37, 570–579 (2019).
Conway, A.-M., Mitchell, C. & Cook, N. Challenge of the unknown: how can we improve clinical outcomes in cancer of unknown primary? J. Clin. Oncol. 37, 2089–2090 (2019).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). 785–794 (Association for Computing Machinery, 2016).
Bochtler, T. & Krämer, A. Does cancer of unknown primary (CUP) truly exist as a distinct cancer entity? Front. Oncol. 9, 402 (2019).
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).
Tate, J. G. et al. Cosmic: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).
da Cunha Santos, G., Shepherd, F. A. & Tsao, M. S. EGFR mutations and lung cancer. Annu. Rev. Pathol. 6, 49–69 (2011).
Zhang, Y.-L. et al. The prevalence of EGFR mutation in patients with non-small cell lung cancer: a systematic review and meta-analysis. Oncotarget 7, 78985 (2016).
Hecht, S. S. Tobacco smoke carcinogens and lung cancer. J. Natl Cancer Inst. 91, 1194–1210 (1999).
Dirican, E., Akkiprik, M. & Özer, A. Mutation distributions and clinical correlations of PIK3CA gene mutations in breast cancer. Tumor Biol. 37, 7033–7045 (2016).
Elsheikh, S. et al. CCND1 amplification and cyclin D1 expression in breast cancer and their relation with proteomic subgroups and patient outcome. Breast Cancer Res. Treat. 109, 325–335 (2008).
Kim, J. et al. Unfavourable prognosis associated with K-ras gene mutation in pancreatic cancer surgical margins. Gut 55, 1598–1605 (2006).
Luo, J. KRAS mutation in pancreatic cancer. Semin. Oncol. 48, 10–18 (2021).
Conway, A. M. et al. Molecular characterisation and liquid biomarkers in carcinoma of unknown primary (CUP): taking the ‘U’ out of ‘CUP’. Br. J. Cancer 120, 141–153 (2019).
Liu, R. et al. Systematic pan-cancer analysis of mutation–treatment interactions using large real-world clinicogenomics data. Nat. Med. 28, 1656–1661 (2022).
Liu, R. et al. Evaluating eligibility criteria of oncology trials using real-world data and AI. Nature 592, 629–633 (2021).
Grambsch, P. M. & Therneau, T. M. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 81, 515–526 (1994).
Chakravarty, D. et al. OncoKB: a precision oncology knowledge base. JCO Precis. Oncol. 1, PO.17.00011 (2017).
Moiso, E. et al. Developmental deconvolution for classification of cancer origin. Cancer Discov. 12, 2566–2585 (2022).
Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).
Fizazi, K. et al. Cancers of unknown primary site: ESMO clinical practice guidelines for diagnosis, treatment and follow-up. Ann. Oncol. 26, v133–v138 (2015).
Mileshkin, L. et al. Cancer-of-unknown-primary-origin: a SEER–Medicare study of patterns of care and outcomes among elderly patients in clinical practice. Cancers 14, 2905 (2022).
Moon, I., Groha, S., & Gusev, A. SurvLatent ODE: a neural ODE based time-to-event model with competing risks for longitudinal data improves cancer-associated venous thromboembolism (VTE) prediction. In Proceedings of the 7th Machine Learning for Healthcare Conference. 800– 827 (PMLR, 2022).
Kehl, K. L. et al. Natural language processing to ascertain cancer outcomes from medical oncologist notes. JCO Clin. Cancer Inform. 4, 680–690 (2020).
Garcia, E. P. et al. Validation of oncopanel: a targeted next-generation sequencing assay for the detection of somatic variants in cancer. Arch. Pathol. Lab. Med. 141, 751–758 (2017).
Cheng, D. T. et al. Memorial Sloan Kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): a hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology. J. Mol. Diagn. 17, 251–264 (2015).
Chen, Y. et al. Classification of short single-lead electrocardiograms (ECGs) for atrial fibrillation detection using piecewise linear spline and XGBoost. Physiol. Meas. 39, 104006 (2018).
Hatton, C. M. et al. Predicting persistent depressive symptoms in older adults: a machine learning approach to personalised mental healthcare. J. Affect. Disord. 246, 857–860 (2019).
Ogunleye, A. & Wang, Q.-G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 2131–2140 (2019).
Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B. S. & Swanton, C. DeconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biol. 17, 31 (2016).
Janzing, D., Minorics, L. & Blöbaum, P. Feature relevance quantification in explainable AI: a causal problem. In Proceedings of International Conference on Artificial Intelligence and Statistics 2907–2916 (PMLR, 2020).
Gusev, A., Groha, S., Taraszka, K., Semenov, Y. R. & Zaitlen, N. Constructing germline research cohorts from the discarded reads of clinical tumor sequences. Genome Med. 13, 179 (2021).
Cox, D. R. Regression models and life-tables. J. R. Stat. Soc. Ser. B Methodol. 34, 187–202 (1972).
Xie, J. & Liu, C. Adjusted Kaplan–Meier estimator and log-rank test with inverse probability of treatment weighting for survival data. Stat. Med. 24, 3089–3110 (2005).
Marschner, I. glm2: Fitting generalized linear models with convergence problems. The R Journal 3, 12–15 (2011).
Acknowledgements
The participation of patients and the efforts of an institutional data collection system made this study possible, and we are grateful for their contributions. We would also like to express our appreciation to the DFCI Oncology Data Retrieval System (OncDRS) and AACR Project GENIE team for their role in aggregating, managing and delivering the data used in this project.
I.M. and A.G. were supported by R01 CA227237, R01 CA244569 and grants from The Louis B. Mayer Foundation, The Doris Duke Charitable Foundation, The Phi Beta Psi Sorority and The Emerson Collective. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
I.M. and A.G. conceived and designed the study. I.M. curated the data, developed and evaluated the model and performed analyses. J.L. and L.S. performed clinical chart reviews. I.M. wrote the first manuscript. I.M., J.L. and G.S. revised the manuscript. All the authors took part in interpreting the findings and reviewing the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no conflicts of interest.
Peer review
Peer review information
Nature Medicine thanks Lincoln Stein, Linda Mileshkin and E. Cuppen for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 OncoNPC classification performance: confusion matrix, and precision and recall.
Confusion matrices on the held-out test set (n = 7,289) for (a) 22 detailed cancer types and (b) 13 cancer groups (see Table 1). (c),(d) OncoNPC performance in precision and recall on the test set across (c) cancer types and (d) cancer groups at 4 different prediction confidences using \({p}_{\max }\) as a threshold. Each dot size is scaled by the proportion of tumor samples retained. In (d), we only considered cancer groups that have more than one cancer type. Overall F1 scores were weighted according to the number of confirmed cases across cancer types and cancer groups, respectively.
Extended Data Fig. 2 OncoNPC prediction performance and prediction confidence levels (that is, pmax) across different cohorts and centers.
(a), Center-specific OncoNPC performance (in F1) on the test CKP tumor samples (n = 7,289). The figure is a breakdown of Fig. 2c based on cancer center (DFCI: ⊙, MSK: ⊡, VICC: ◇). The performance was evaluated at 4 different prediction confidences (that is, minimum \({p}_{\max }\) thresholds). Each dot size is scaled by the proportion of tumor samples retained. See Supplementary Table 3 for the center-specific number of test CKP tumor samples broken down by cancer types and prediction confidence thresholds. (b), (c) Box plots of prediction confidence (\({p}_{\max }\)) across (b) DFCI CUP tumors, MSK CUP tumors, all DFCI CKP tumors (including those with cancer types not modeled in OncoNPC), DFCI held-out CKP tumors, and DFCI excluded CKP tumors (specifically those with cancer types not modeled in OncoNPC), and (c) DFCI held-out CKP tumors, MSK held-out CKP tumors, and VICC held-out CKP tumors. Note that DFCI excluded CKP tumors refers to the cohort of the rare CKP tumors whose cancer types were not considered during the development of OncoNPC. All cohorts in the analysis for (b) and (c) were not seen by OncoNPC during the model training.
Extended Data Fig. 3 Robustness of OncoNPC performance with respect to input genomics features.
The figure shows the breakdown of OncoNPC performance in F1 score by 22 cancer types across increasing prediction confidence. The cancer types on the y-axis are sorted in a decreasing order of the number of tumor samples. In order to investigate the impact of input genomics features on OncoNPC’s robustness, we performed a feature ablation study, where we chose the most important genes based on their aggregated SHAP values and gradually reduced them from all 846 features associated with those genes, as well as age and sex, to only the top 10% (that is, top 29 features). In each feature configuration, we re-trained the model with the same set of hyperparameters and evaluated its performance on the held-out CKP tumor samples (n = 7,289), which were utilized throughout this work. Supplementary Data 4 provides a list of input features that correspond to the selected genes in each configuration.
Extended Data Fig. 4 Explanation of OncoNPC prediction for a patient with CUP.
The patient is a 76-year-old male with a tumor biopsy from the liver. The pie chart on the left shows the top 10 important features across three different feature categories (that is, CNA events, somatic mutation and mutation signatures), and the scatter plot on the right shows their SHAP values and feature values. The size of each dot is scaled by corresponding absolute SHAP value. From the chart review, we found that the patient reported a 60-pack year smoking history, as well as having lived near a tar and chemical factory as a child. Despite the CUP diagnosis, OncoNPC confidently classified the primary site as NSCLC with posterior probability of 0.98. SBS4, a tobacco smoking-associated mutation signature, was significantly enriched in the patient’s tumor sample, which has, by far, the most impact on the prediction, followed by SBS24 mutation signature associated with known exposures to aflatoxin, and KRAS mutation.
Extended Data Fig. 5 Germline polygenic risk score (PRS) enrichment of CKP tumor samples and CUP tumor samples, broken down by 8 different cancer types.
(a), Colorectal adenocarcinoma (COADREAD), (b) diffuse glioma (DIFG), (c) invasive breast carcinoma (BRCA), (d) melanoma (MEL), (e) non-small cell lung cancer (NSCLC), (f) ovarian epithelial tumor (OVT), (g) prostate adenocarcinoma (PRAD) and (h) renal cell carcinoma (RCC). The magnitude of the enrichment is quantified by \(\hat{\varDelta }_{\mathrm{PRS}}\): the mean difference between the concordant (that is OncoNPC matching) cancer type PRS and mean of PRSs of discordant cancer types (see Methods). \(\hat{\varDelta }_{\mathrm{PRS}}\) is shown for CKPs in blue (for reference) and CUPs in green.
Extended Data Fig. 6 Exclusion criteria for downstream clinical analyses.
The boxes on the left show the number of the remaining patients in the cohort and relevant analyses, while the boxes on the right illustrate the exclusion criteria and the number of patients who were consequently removed.
Extended Data Fig. 7 Estimated survival curves for the concordant and discordant treatment groups among patients with CUP, broken down by OncoNPC predicted cancer types.
a, BRCA, (b) gastrointestinal (GI) group (CHOL, COADREAD, EGC and PAAD), (c) lung (NSCLC and PLMESO) and (d) other OncoNPC cancer types (BLCA, DIFG, GINET, HNSCC, MEL, OVT, PANET, PRAD, RCC and UCEC). In each figure, the concordant treatment group and discordant treatment group are shown in blue and red, respectively. To estimate each survival curve, we utilized inverse probability of treatment weighted (IPTW) Kaplan-Meier estimator while adjusting for patient covariates and left truncation until time of sequencing (see Methods). Statistical significance of the survival difference between the two groups was estimated by a weighted log-rank test.
Extended Data Fig. 8 Estimated survival curves for the concordant and discordant treatment groups among patients with CUP who received their initial treatments after the results of the OncoPanel sequencing were available to clinicians.
Similarly, we utilized inverse probability of treatment weighted (IPTW) Kaplan-Meier estimator for each survival curve while adjusting for patient covariates and left truncation until time of sequencing (see Methods). Statistical significance of the survival difference between the two groups was estimated by a weighted log-rank test. Refer to Supplementary Table 2 for demographic information on the cohort.
Extended Data Fig. 9 OncoNPC-guided actionable variants in patients with CUP.
(a), The number of CUP tumors with actionable targets, based on OncoKB (Methods), across actionable somatic variants (mutations, amplifications and fusions). Each bar corresponds to the total number of CUP tumors associated with each actionable target. The bars are color-coded by predicted cancer types. Note that each tumor may contain more than one actionable somatic variant. (b), Proportions of CUP tumor samples with actionable somatic variants (\({N}_{{action}}\)) to the total number of patients (\({N}_{{total}}\)) across OncoNPC predicted cancer types. Proportions for four different therapeutic levels based on OncoKB are shown in each bar: level 1—FDA-approved drugs, level 2—standard of care drugs, level 3—drugs supported by clinical evidence and level 4—drugs supported by biological evidence.
Supplementary information
Supplementary Information
Supplementary Notes 1–13, Supplementary Figs. 1–10 and Supplementary Tables 1–3.
Supplementary Data 1
OncoNPC input feature genes targeted across different panel versions.
Supplementary Data 2
A full set of features utilized in OncoNPC.
Supplementary Data 3
Aggregated SHAP values for OncoNPC features.
Supplementary Data 4
Features utilized across different settings of the ablation study.
Supplementary Data 5
Patient information in the treatment concordance analysis cohort.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Moon, I., LoPiccolo, J., Baca, S.C. et al. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nat Med 29, 2057–2067 (2023). https://doi.org/10.1038/s41591-023-02482-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41591-023-02482-6