Main

Cancer of unknown primary, also known as occult primary tumors, is a heterogeneous group of tumors whose primary site cannot be found when the cancer has metastasized.1 Per 100 000 individuals, the incidence varies from 5 to 7 cases in Europe, 7 to 12 cases in the USA, and 18 to 19 cases in Australia.2 The latest data show that cancer of unknown primary accounts for ~3–5% of all newly diagnosed cancers,2 and it is the fourth leading cause of cancer-related death worldwide.3, 4 Generally, the prognosis of patients with carcinoma of unknown primary site is poor for those receiving empiric chemotherapy. The median survival period is 3–9 months, even when newer combination treatment regiments are administered.5 Hence, cancer of unknown primary remains an important clinical problem that generates frustration among surgeons, oncologists, and pathologists, in addition to the uncertainty and stress it imposes on patients. Identification of the primary site can ease the patient’s anxiety and improve long-term survival with the help of more specific therapy.2, 6

In current clinical practice, patients with carcinoma of unknown primary should inform doctors of their medical history and receive detailed physical examination, laboratory testing, digital imaging, and endoscopic examination. Positron emission tomography–computed tomography, the most efficient imaging test to depict the tumor tissue of origin, can only detect 24–53% of primary lesions of cancer of unknown origin.7 Histological examination, particularly immunohistochemistry, is the cornerstone to identify the tumor of origin. However, even with the best experts and the most advanced technology, the primary site can be identified in only 20–30% of patients with cancer of unknown primary,8 and the results can be subjective.

This clinical need has resulted in a quest for better and more accurate identification of the primary site of tumors. To address this need, several studies have demonstrated that the expression levels of tens to hundreds of genes can be used as a ‘molecular fingerprint’ to classify a multitude of tumor types. Varadhachary et al9 and Talantov et al10 presented a reverse transcription polymerase chain reaction-based method that measures the expression of 10 signature genes among six tumor types. Ma et al11 developed a similar method based on 92 genes to classify 32 tumor types. Tothill et al12 reported a 79-gene panel to discriminate among 13 tumor types. Instead of measuring conventional gene expression, Rosenfeld et al13 analyzed microRNA expression to classify tumor samples.

With the rapid evolution of microarray technology over the last decade, there have been tremendous efforts invested in the field of cancer research using standardized genome-wide microarrays. Considering the large amount of high-quality, publicly available gene expression data sets, the integrative analysis of genomic data, in which data from multiple studies are combined to increase the sample size and avoid laboratory-specific bias, has the potential to yield new biological insights that are not possible from a single study.14

In the present study, we established a comprehensive gene expression database containing the genome-wide expression profiles of more than 16 000 tumor samples representing 22 common human cancer types. By using an innovative analytical method, we aimed to develop a gene expression signature to aid in the identification of tumor origin.

Materials and methods

Sample Collection and Data Curation

The gene expression data sets of 16 674 tumor samples with histologically confirmed origins were collected from public data repositories (eg, ArrayExpress, Gene Expression Omnibus, and The Cancer Genome Atlas Data Portal) and curated to form a comprehensive pan-cancer transcriptome database.

Array-based gene expression profiling of 7048 tumor samples was mainly conducted on three different platforms of Affymetrix oligonucleotide microarray: GeneChip Human Genome U133A Array, U133A 2.0 Array, and U133 Plus 2.0 Array. Data from raw CEL files were pre-processed using the single-channel array normalization method with default parameters. Although different opinions exist concerning data pre-processing, the single-channel array normalization method was considered as most suitable for personalized-medicine workflows. Rather than processing microarray samples as groups, which can introduce biases and present logistical challenges, the single-channel array normalization method can normalize each sample individually by modeling and removing probe- and array-specific background noise using only internal array data.15 We further used the alternative CDF files from BrainArray Resource (http://brainarray.mbni.med.umich.edu/) to summarize the probe level intensities directly to the Entrez gene IDs. Probes mapping to multiple genes and other problems associated with old generations of Affymetrix probe designs were thereby excluded.16

Sequencing-based gene expression profiling of 9626 tumor samples were generated on the Illumina HiSeq 2000 RNA sequencing platform and kindly provided by The Cancer Genome Atlas pan-cancer analysis working group at Synapse website (https://www.synapse.org/).17 The gene expression profile consists of transcriptomic data for 20 501 unique genes. The clinical information for selected samples was retrieved from the ‘Clinical Biotab’ section of the data matrix based on the Biospecimen Core Resource IDs of the patients.

Gene Signature Identification

Gene expression data analysis was performed using R software and packages from the Bioconductor project.18, 19, 20 To identify a gene expression signature, we used the support vector machine—recursive feature elimination algorithm for feature selection and classification modeling.21 For multi-class classification, a one-versus-all approach was used whereby multiple binary classifiers are first derived for each tumor type. The results are reported as a series of probability scores for each of the 22 tumor types. The probability score was estimated as an indicator of the certainty of a classification made by the gene expression signature. The probability score ranges from 0 (low certainty) to 100 (high certainty) and sum to 100 across the 22 primary tumor types. A threshold of probability score equal to 50 was established to indicate the confidence of a single classification. When the probability score fell below 50, the samples were considered ‘unclassifiable cases’. When the probability score was above 50, the tumor type with the highest probability score was considered the tumor of origin. An example of gene expression signature classification is shown in the Supplementary Figure 1.

Signature Performance Assessment

For each specimen, the predicted primary site of the tumor was compared with the reference diagnosis. A true-positive result was indicated when the predicted tumor type matched the reference diagnosis. When the predicted tumor type and reference diagnosis did not match, the specimen was considered a false positive. For each tissue on the panel, sensitivity was defined as the ratio of true-positive results to the total positive samples analyzed, while specificity was defined as the ratio (1−false positive)/(total tested−total positive). The diagnostic odds ratio was calculated as a combination of the sensitivity and specificity as described by Glas et al.22

Results

Establishment of Pan-Cancer Transcriptome Database

To create a cancer transcriptome database for tumor primary site identification, the following issues were primarily considered. First, our database should span the tumor sites to be as large as possible. Second, within each tumor type, all possible histological subtypes should be covered. In addition, to mimic the performance of the candidate gene expression signature to identify the tumor origin in carcinoma of unknown primary, metastatic cancers, poorly differentiated tumors, and undifferentiated tumors should also be included. Thus, a systematic search of major biological data repositories—eg, ArrayExpress, Gene Expression Omnibus, and The Cancer Genome Atlas project—was performed to collect the gene expression profiling data sets of different tumor types.

Overall, we accumulated the gene expression profiles of 16 674 tumor samples to form a comprehensive pan-cancer transcriptome database. The carcinomas originated from 22 major tissue types, including adrenal gland, brain, breast, cervix, colorectal, endometrium, gastroesophagus, head and neck, kidney, liver, lung, lymphoma, melanoma, mesothelioma, neuroendocrine, ovary, pancreas, prostate, sarcoma, testis, thyroid, and urinary. The database also contains patient demographic data and clinical information. To identify a reliable gene expression signature, we adopted a training-validation approach in this study. First, the gene expression profiles of 5800 primary tumors with histologically confirmed origins were retrieved from the database and curated to form a large training set. Next, two independent validation sets were formed: one is composed of sequencing-based gene expression profiles of 9626 tumor specimens with histologically confirmed origins (test set 1) and the other is composed of gene expression profiles of 1248 tumor specimens that were poorly differentiated, undifferentiated or from metastatic tumors (test set 2). Figure 1 depicts three different phases of our study design and Table 1 summarizes the clinical characteristics of the samples in the study.

Figure 1
figure 1

Flow diagram of gene expression signature identification and performance evaluation.

Table 1 Summary of sample information

Gene Selection and Functional Annotation

The training set consisted of 5800 samples covering more than 95% of solid tumors by incidence, with 55–542 specimens per tumor class that encompass a range of intratumor heterogeneity. After data normalization and annotation steps, a matrix of 12 000 unique genes in 5800 samples (≈70 million data points) was prepared for downstream bioinformatics analyses. Extracting a subset of informative genes from such high-dimension genomic data is a critical step for gene expression signature identification. Although many algorithms have been developed, the support vector machine—recursive feature elimination approach is considered one of the best gene selection algorithms. For each tumor type, we used the support vector machine—recursive feature elimination approach to: (1) evaluate and rank the contributions of each gene toward the optimal separation of a specific cancer type from other tumor types; (2) select the top 10-ranked genes as the most differentially expressed genes for this tumor type; and (3) repeat this process for each tumor types, and obtain 22 lists of the top 10 gene set. After removing redundant features, 154 unique genes were obtained. Full list of the 154 candidate genes with respect to each tumor types were provided in Table 2.

Table 2 List of selected 154 candidate genes and related tumor types

We further investigated whether these candidate genes revealed biological features known to be relevant to different cancers. Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis was performed using the GeneCodis bioinformatics tool (http://genecodis.dacya.ucm.es/).23 As shown in Table 3, a diverse group of gene families is represented in the 154-gene list. The most significantly enriched gene categories are those involved in specific biological processes, including tyrosine metabolism, fat digestion and absorption, cytokine–cytokine receptor interaction, extracellular matrix–receptor interaction, and gastric acid secretion. Even more interestingly, genes described in oncogenic pathways such as those of bladder cancer, melanoma, and prostate cancer were also significantly overrepresented, reflecting their differential involvement in a range of tumor classes.

Table 3 The top Kyoto Encyclopedia of Genes and Genomes pathways enriched in the 154-gene list

Leave-One-Out Internal Cross-Validation

As an initial step, we assessed the performance of the classifier using leave-one-out cross-validation within the training set. Leave-one-out cross-validation simulates the performance of a classification algorithm on unseen samples. With leave-one-out cross-validation, the algorithm is repeatedly retrained, leaving out one sample in each round and testing each sample on a classifier that was trained without this sample. The 154-gene expression signature showed an overall accuracy of 96.5% (5597 of 5800; 95% CI 96.0 to 97.0%) with notable variation between different cancer types. Sensitivities ranged from 89.7% (endometrium) to 100% (neuroendocrine). Using this internal validation of the training set, these data provide a preliminary estimate of classification performance.

Independent Validation in Primary Tumors Profiled with Next-Generation Sequencing

The final classification model of the 154-gene expression signature was established using the entire training set and then applied to an independent validation set comprising 9626 primary tumor samples profiled with next-generation sequencing (test set 1). Representation from 22 sites ranged from 48 (lymphoma) to 1218 (breast). The 154-gene expression signature estimated 9100 (94.5%) of 9626 samples with probability scores above 50 as ‘valid classification’. Among these 9100 valid cases, the 154-gene expression signature showed 97.1% overall agreement with the reference diagnosis (8839 of 9100; 95% CI 96.8 to 97.5%). Figure 2 shows a matrix of the relationship of the test results compared with the reference diagnoses. Sensitivities for the 22 main cancer types ranged from 84.2% (gastroesophagus) to 100% (prostate). Specificities ranged from 99.4% (gastroesophagus) to 100% (mesothelioma, neuroendocrine and thyroid). The detailed sensitivity and specificity are listed in Table 4. A total of 526 cases (5.5%) were considered ‘unclassifiable’ by the 154-gene expression signature, with probability scores below 50. Cervix, urinary, sarcoma, head and neck, gastroesophagus, and endometrium were the most common biopsy sites among those unclassifiable cases. Diagnostic odds ratios for all the 22 tumor types were significantly >1, indicating that each class reported by the 154-gene expression signature provides significant discrimination and performance.

Figure 2
figure 2

Confusion matrix by tumor type of the test set 1. Reference diagnoses are shown across the top row, and 154-gene expression signature predictions are shown along the left-hand column. The matrix shows the direct relationship between each adjudicated reference diagnosis versus the molecular classifier prediction, including reproducible patterns of classification and misclassification.

Table 4 Performance characteristics of the 154-gene expression signature in two test sets

Independent Validation in Metastatic and Undifferentiated Tumors

The 154-gene expression signature was further validated in the test set 2 comprising 1248 tumor specimen samples. For the test set 2, we particularly enriched for tumor metastatic specimens with known primary sites or primary tumors with poor differentiation because these probably reflect the clinical circumstance of carcinoma of unknown primary. Representation from 22 sites ranged from 12 (thyroid) to 216 (sarcoma). The 154-gene expression signature estimated 1077 (86.3%) of 1248 samples with probability scores above 50 as ‘valid classification’. Among these 1077 valid cases, the 154-gene expression signature showed 92% overall agreement with the reference diagnosis (991 of 1077; 95% CI 90.2 to 93.6%). Figure 3 shows a matrix of the relationship of the test results compared with the reference diagnoses. Sensitivities for the 22 main tumor types ranged from 38.9% (pancreas) to 100% (adrenal, brain, head and neck, liver, neuroendocrine, and testis). Specificities ranged from 98.0% (lung) to 100% (adrenal, brain, cervix, mesothelioma, neuroendocrine, pancreas, and prostate). The detailed sensitivity and specificity are listed in Table 4. One hundred seventy-one (13.7%) cases were considered ‘unclassifiable’ by the 154-gene expression signature, with probability scores below 50. Prostate, kidney, pancreas, urinary, adrenal, and melanoma were the most common biopsy sites among those unclassifiable cases. Diagnostic odds ratios for all 22 tumor types were significantly >1.

Figure 3
figure 3

Confusion matrix by tumor type of the test set 2. Reference diagnoses are shown across the top row, and 154-gene expression signature predictions are shown along the left-hand column. The matrix shows the direct relationship between each adjudicated reference diagnosis versus the molecular classifier prediction, including reproducible patterns of classification and misclassification.

Discussion

Owing to great advancements in high-throughput microarray technologies and the comprehensive efforts of systematic cancer genomics projects, we were able to utilize large genomic data sets for our study. We report here the creation of a pan-cancer gene expression database from more than 160 000 human tumor samples and demonstrate that multi-class tumor classification is feasible by comparing an unknown sample to this reference database. The 154-gene expression signature demonstrated an overall accuracy of 96.5% for 22 tumor types by cross-validation of the training set, and 97.1% in an independent test set of 9626 primary tumors profiled with the next-generation sequencing. Furthermore, we tested the signature on a spectrum of diagnostically challenging tumors. An overall accuracy of 92% was achieved on the 1248 tumor specimens that were poorly differentiated, undifferentiated, or from metastatic tumors.

Several investigations have reported multigene algorithms and results that demonstrate the promise of gene expression-based signatures in tumor origin identification. Unlike many studies in which samples were often dominated by well-differentiated primary cancers, our approach directly exploited undifferentiated metastatic tumor samples for the validation of our 154-gene expression signature. In a clinical scenario, the uncertainty of tumors’ origin usually arises within the context of metastatic and/or poorly differentiated to undifferentiated malignancies, and some of the previously published gene expression-based signatures have shown decreased performance with less-differentiated tumors. In this study, we show that the 154-gene expression signature could reliably identify the tumor origin in 92% of the 1077 tumor samples tested. This accuracy is comparable to other gene expression-based signatures with reported accuracies in the range of 79–91%.24, 25, 26 The performance of this test also compares favorably with current clinical practice standards such as immunohistochemistry, which has shown 75% accuracy in metastatic samples using a predetermined panel of 10 antibodies.27

It is noteworthy that the expression patterns of several genes among the 154-gene panel have been observed previously by other methods to be relatively tissue specific for certain types of carcinomas—eg, KLK3 has been identified as the gene encoding prostate-specific antigen, which has long been known as an important tumor marker used in the diagnosis and monitoring of prostate cancer. Originally, it was thought that prostate-specific antigen was only produced by the cells of the prostate gland. Recently, it has been shown that elevated levels of prostate-specific antigen are also observed in some breast and gynecologic cancers.28, 29 In addition, overexpression of the EGFR gene occurs across a wide range of different cancers, including brain, colorectal, lung, esophageal, cervical cancers, and sarcoma.30, 31, 32, 33, 34, 35 CDH1 and VEGFA have been reported among the highly significant markers in colorectal, gastric, and liver cancers.36, 37, 38, 39, 40, 41

The 154-gene expression signature shows clear promise in identifying the tumor’s origin, but it is not perfect. For diagnostically challenging tumors, systematic errors were noted in the classes of endometrial and pancreatic tumors (58 and 61% misclassified, respectively). Among the seven misclassified endometrial cancers, five were predicted to be ovarian cancer. Given the current controversies over the ontogeny of female genital tract cancers,42, 43, 44, 45 molecular profiling with the 154-gene expression signature may reflect this biologic intersection and provide additional insight into the origin of these tumors. Among the 11 misclassified pancreatic cancers, six were predicted to have originated from the gastroesophagus, and four from the liver. It is known that pancreatic cancer has a complex and heterogeneous genetic base, which is often identified as esophageal cancer.46 Indeed, pancreatic cancer is the most difficult type of carcinoma of unknown primary to identify using our method as well as all published methods.8, 24, 47, 48, 49, 50

Additional research is needed to successfully translate the 154-gene signature from gene expression microarray to real-time reverse transcription polymerase chain reaction assays, thus allowing broader access and utilization in the clinical setting. In routine practice, most diagnostic materials are formalin-fixed and paraffin-embedded; thus, it will be highly interesting to assess the usefulness of the 154-gene signature in formalin-fixed and paraffin-embedded samples. Future translational research should focus on the development and validation of the real-time polymerase chain reaction-based gene expression test using formalin-fixed and paraffin-embedded samples.

In conclusion, this study describes the development and validation of a gene expression-based signature to assist in the identification of the origin of tumor tissue. We foresee its application in cases of poorly differentiated or undifferentiated metastatic tumors and in cases where histology alone fails to suggest a specific primary site of origin. Further studies evaluating the impact of gene expression-based test results on therapy choice and treatment outcome for patients with carcinoma of unknown primary are warranted.