Introduction

In recent years, progress in genome sequencing technologies and bioinformatics has provided enormous gains in understanding of the molecular aberrations associated with the development of various cancers. The emergence of publicly available cancer genomic datasets, including The Cancer Genome Atlas (TCGA), facilitates the comprehensive understanding of the molecular pathogenesis of cancer and is allowing for the development of new strategies to improve cancer diagnosis, therapy, and prevention. By analysing these publicly available genomic data, many novel disease-associated genes have been uncovered.1,2

TCGA was formed in 2005 when the U.S. National Cancer and National Human Genome Research Institutes teamed together to support the launch of the project to comprehensively map various cancer genomic changes. To date, more than 11,000 individuals  with 33 cancer types have been included in the cohort.3,4 These data have thus far contributed to >2000 studies of various cancers in PubMed.

The cohort composition for each cancer type is an important consideration since the results generated from these cases may be used to make inferences about the respective cancer type among the general population. Prior studies have shown that race, which is often used as a proxy for ancestry and social exposures, is related to the pathogenesis of cancer and different genetic backgrounds in common tumour types may influence clinical outcome and response to therapy.5,6,7 Evidence has shown that somatic mutation frequency differs by race in various cancer types,8,9,10 implying that factors associated with race can impact the somatic mutation landscape. Other evidence also highlights the implications of sex and age dissimilarities in genetic susceptibility to cancer.11,12,13 For these reasons and because TCGA data was assembled mainly from an eligible convenience sample of cancer patients with strict sample selection criteria,14 it is important to understand similarities and differences in the characteristics of individuals who have contributed samples to TCGA relative to those of the general population of individuals diagnosed with cancer. A previous study of TCGA cases found that race/ethnicity disparities exist relative to the U.S. general population for ten cancer types examined comprising 5729 cases.15 Another study that analysed nine different cancer types in TCGA indicated a dissimilar age distribution in comparison with corresponding cases in the Surveillance, Epidemiology, and End Results (SEER) database.16 However, differences in demographic and clinical characteristics beyond race/ethnicity and age between members of the TCGA and the general U.S. population of cancer cases have not been systematically characterised.

In this study, we extend the results from previous studies by comparing demographic and clinical characteristics (age at diagnosis, sex, race, stage at diagnosis, and survival months) between TCGA cases with 33 cancer types and cases in two population-based databases: (1) the SEER 18 database that currently covers ~28% of the U.S. population,17 and (2) the U.S. combined registries of North American Association of Central Cancer Registries (NAACCR) that covers cancer registrations in all 50 states and the District of Columbia.18

Methods

Population

Three separate data sources were used in this study: TCGA,19 the SEER 18 database,17 and the NAACCR public use dataset.20 Data from individuals diagnosed with 33 cancer types were extracted from TCGA. No duplicate cases were found across various cancer types as determined by matching TCGA case IDs. Individuals with corresponding cancer types in SEER were identified using the third edition of the International Classification of Diseases for Oncology (ICD-O-3) by primary site and histology/behavior (Supplementary Table 1). To compare TCGA cases to a contemporary population of individuals diagnosed with cancer, only cases diagnosed with a primary malignancy from 2010 to 2013 in SEER were included. Since SEER intentionally oversamples U.S. minority populations,21 we used data from NAACCR to compare race distributions. This public use dataset published in the annual Cancer in North American (CiNA) Volumes covers cancer registrations in all 50 states and the District of Columbia, approaching 100% coverage of the U.S. population in the most recent time period.22 The most current five  years (2009–2013) of data for U.S. and Canadian individuals diagnosed with cancer  were available in this dataset. In this study, only U.S. cancer cases  with available race data were included. The corresponding cancer types in NAACCR were defined using the cancer sites as denoted in Supplementary Table 1.

Variables

XML files from TCGA containing data on demographics, cancer variables, and follow-up status were downloaded from the National Cancer Institute Genomic Data Commons data portal19 on 22 December 2016. Python 3.6.0 was used to parse these files and extract the variables. Demographic data including diagnosis age, sex, and race were extracted from the “clin_shared:age_at_initial_pathologic_diagnosis”, “shared:gender”, “clin_shared:race” fields. Race was categorised as White, Black (African American), and Other (Asian, American Indian, or Alaska Native). Ethnicity was not included in this analysis due to the large proportion (24%) of cases with missing data for this field. Clinical information was extracted from the “shared_stage:clinical_stage”, “shared_stage:pathological_stage”, “shared_last_contact_days_to”, and “shared_death_days_to” fields. Stage was defined according to American Joint Committee on Cancer (AJCC) staging that includes categories I, II, III, and IV. Survival months were calculated using the “shared_last_contact_days_to” field for cases who were still alive and “shared_death_days_to field” for cases who were deceased during the follow-up period divided by days in a month (365.24/12).23 Similarly, the demographic and clinical data of the 33 corresponding cancers were extracted from the SEER 18 database using SEER*Stat 8.3.4. Diagnosis age was based on the SEER variable “Age at diagnosis”. Race classifications were based on the “Race recode (W, B, AI, API)” variable and defined the same as above. Stage at diagnosis was defined using the “Derived AJCC Stage Group (7th edition 2010+)” variable. Survival months were defined using the “Survival months” variable. In NAACCR, the race categories were based on the “Race (Includes Hispanic)” variable and defined the same as for TCGA.

Statistical analysis

Stata version 14 was used for all analyses. Student’s t-test and Cohen’s d, a measure of effect size, were used to identify and quantify the statistical differences and effect sizes in diagnosis age. Cohen’s d is calculated as the difference between two means divided by the pooled standard deviation.24 By convention, Cohen’s d ≥ 0.3 indicates at least a moderate effect size. Ordinary least squares regression was used to estimate the overall mean difference in diagnosis age between TCGA and SEER cases with adjustment for cancer types. Chi-square and Fisher’s exact tests were used to identify proportion differences in sex, race, and stage. Additionally, for race and stage comparisons, adjusted residuals were used to determine categories with the largest difference relative to sample size. An adjusted residual ≥ 2.0 indicates that there was a significantly greater proportion of a particular race or stage category among TCGA cases than in the comparison population (i.e., NAACCR or SEER), while an adjusted residual ≤ −2.0 indicates a significantly lower proportion. We also quantified the mean all-cause survival months using restricted mean survival time (RMST) analysis25 using 12 months as the end point to ensure that all TCGA cases that were included have the same window of observation. Since all TCGA cases were diagnosed prior to 2014, all had at least 12 months of follow-up time except for the cases who died during this period. For cases with over 12 survival months, the survival months were truncated at 12. The RMST approach is valid for any distribution of time to event.25,26,27,28,29 The between-group difference in mean survival with corresponding 95% confidence intervals (CIs) was estimated at 12-month horizon with adjustment for diagnosis age, sex, race, and stage if available for a specific cancer type, and a subsequent generalised linear model with robust standard errors. Statistical tests for all analyses were two-tailed tests and the critical value for alpha for all tests was 0.05.

Results

Of 11,160 TCGA cases with 33 cancer types diagnosed between 1978 and 2013, 1097 cases were diagnosed with breast invasive carcinoma (BRCA) followed by glioblastoma multiforme (GBM, n = 596), ovarian serous cystadenocarcinoma (OV, n = 587), uterine corpus endometrial carcinoma (UCEC, n = 548), kidney renal clear cell carcinoma (KIRC, n = 537), head and neck squamous cell carcinoma (HNSC, n = 528), lung adenocarcinoma (LUAD, n = 522), and brain lower grade glioma (LGG, n = 515). Six cancers including adrenocortical carcinoma (ACC), cholangiocarcinoma (CHOL), lymphoid neoplasm diffuse large B-cell lymphoma (DLBC), mesothelioma (MESO), uterine carcinosarcoma (UCS), and uveal melanoma (UVM) had < 100 cases each. Among the corresponding 33 diagnoses in SEER, the number of cases ranged from 164 (UCS) to 203,828 (BRCA). In NAACCR, the number of cases ranged from 15,705 (MESO) to 1,085,443 (BRCA). Demographic and clinical characteristics of TCGA and SEER cases are shown in Table 1. The race distribution of TCGA and NAACCR cases is shown in Table 2.

Table 1 Demographic and clinical characteristics of TCGA and SEER cases
Table 2 Race distribution of TCGA and NAACCR cases

Age at diagnosis

Overall, the mean diagnosis age of TCGA cases was 3.9 years younger (95% CI: 1.7–6.2, P< 0.001) than SEER cases after adjusting for cancer types (data not shown). The mean diagnosis age of TCGA cancer cases  was not significantly different from that of SEER cases for a minority of cancers (CHOL, colon adenocarcinoma (COAD), KIRC, kidney renal papillary cell carcinoma (KIRP), pheochromocytoma and paraganglioma (PCPG), sarcoma (SARC), stomach adenocarcinoma (STAD), thymoma (THYM), and UVM). In contrast, for most cancer types (24/33), there were statistically significant differences in the mean diagnosis age. Among these, the majority (20/24) had a significantly younger mean diagnosis age with the exceptions of LGG, rectum adenocarcinoma (READ), UCEC, and UCS, TCGA cases that had statistically significant older mean diagnosis age than SEER cases (Fig. 1). The difference in the mean diagnosis age was especially pronounced for DLBC (8.4 ± 2.4 years younger in TCGA), oesophageal carcinoma (ESCA, 3.8 ± 0.9 years younger), kidney chromophobe (KICH, 7.4 ± 1.3 years younger), LGG (7.5 ± 0.9 years older), liver hepatocellular carcinoma (LIHC, 3.6 ± 0.7 years younger), MESO (8.4 ± 1.3 years younger), prostate adenocarcinoma (PRAD, 4.7 ± 0.4 years younger), and UCS (4.3 ± 1.5 years older) cases where the absolute effect size for the diagnosis age difference (Cohen’s d) was ≥ 0.3 (Table 3).

Fig. 1
figure 1

Age at diagnosis difference between TCGA and SEER cases. Filled diamonds indicate a statistically significant difference (P < 0.05). The y-axis shows the effect size in terms of Cohen’s d. Cohen’s d was calculated as the difference between the mean diagnosis age for TCGA and SEER cases divided by the pooled standard deviation24 of each cancer with Cohen’s d > |±0.3| indicating at least a moderate effect size. Cohen’s d < 0 indicates TCGA cases with a younger mean age than SEER cases

Table 3 Differences of demographic and clinical characteristics distribution among TCGA, SEER, and NAACCR casesa

Sex

For most cancer types (22/27), the observed sex distribution for TCGA cases was similar to SEER cases. Lung squamous cell carcinoma (LUSC), skin cutaneous melanoma (SKCM), and thyroid carcinoma (THCA) had a significantly higher proportion of male cases (74.0% vs. 62.4%, 61.7% vs. 56.6%, and 26.8% vs. 22.8%, respectively), while LIHC and SARC cases had an excess of female cases (32.4% vs. 22.6%, 54.4% vs. 46.7%) in TCGA vs. SEER (Tables 1 and 3).

Race

Overall, compared to the NAACCR cases, individuals whose reported race was Other (Asian, American Indian, or Alaska Native) were over-represented in TCGA. The observed race distribution was disproportional for 13/18 cancer types (Fig. 2a). Among the 13 cancers, eight (bladder urothelial carcinoma (BLCA), BRCA, ESCA, LIHC, pancreatic adenocarcinoma (PAAD), SKCM, STAD, and THCA) had a significantly higher percentage (adjusted residuals ≥ 2) of individuals with reported Other race (Asian, American Indian, or Alaska Native) and eight (cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), ESCA, LIHC, OV, PAAD, PRAD, SARC, and STAD) had a lower percentage (adjusted residuals ≤ −2) of reported Black race in TCGA vs. NAACCR (Table 3).

Fig. 2
figure 2

Race and stage proportion difference. a Race proportion of TCGA and NAACCR cases. b Stage proportion of TCGA and SEER cases. The stars indicate the difference was statistically significant at an alpha threshold P < 0.05

Stage at diagnosis

For the 26 TCGA cancer types with stage information, evidence for stage dissimilarities was observed for most cancer types (25/26) (Fig. 2b). Specifically, compared to SEER cases, 16 cancers had a significantly lower proportion of stage I in the TCGA cohort, 19 cancers had a significantly higher proportion of stage II, 12 cancers had a significantly higher proportion of stage III, and 14 cancers had a significantly lower proportion of stage IV (Table 3).

Survival months

Using 12 months as an end point, the adjusted mean all-cause survival months were significantly longer for cases with 27/33 cancer types in TCGA relative to SEER. For the remaining six cancer types (CESC, KICH, KIRC, OV, testicular germ cell tumours (TGCT), and UVM), no statistically significant difference was found (Fig. 3). It is noteworthy that for CHOL and SARC, TCGA cases lived an average of over 2 months (2.35 and 2.47 months, respectively) longer than SEER cases after 12 months of follow-up (Table 3).

Fig. 3
figure 3

Survival months difference between TCGA and SEER cases. Mean survival months difference at 12-months of follow-up with corresponding 95% CIs. Dots indicate the mean survival months difference and lines represent its 95% CIs. The y-axis below zero indicates TCGA cases with a shorter survival time than SEER cases

Discussion

In this study, we observed that despite an approximately equal sex distribution for most cancer types included in TCGA vs. SEER data, differences exist in mean diagnosis age, race, stage at diagnosis distributions, and mean survival months. Generally, our analysis indicates that TCGA cases are younger and survive longer than those from SEER.

A previous study comparing the characteristics of TCGA cases to the U.S. general population was conducted by Spratt et al.15 The authors reported that TCGA cases with 10 cancer types compared to the U.S. population were 77% vs. 64% White, 12% vs. 12% Black, 3% vs. 5% Asian, 3% vs. 16% Hispanic, and 0.5% vs. 1–2% Native Hawaiian, Pacific Islander, Alaskan Native, or American Indian descent. White cases were over-represented and Asian and Hispanic cases were under-represented compared to the general population. However, the Spratt et al. study used the general U.S. population as the comparator, which is different from the composition of U.S. cancer patients who are one of the prime beneficiaries of TCGA results.

Another more recent study compared the distribution of TCGA cases  by age to SEER cases  for  nine cancer types.16 Similar to our study, the age distributions for cases in the SEER database were skewed older than those in the TCGA data for nearly all cancer types examined. Specifically, TCGA cases  < 70 years were well represented across most tumour types, but  cases aged 80–99 years were under-represented for all cancers. These data are also consistent with that from clinical trials.30 TCGA specimens are primarily from U.S. academic institutions,3,15 suggesting that younger patients are more likely to be seen at academic centres and participate in research where the samples were acquired. A systematic review on the recruitment of older cancer patients to clinical trials reported that age is a significant barrier to recruitment.31 For example, Kemeny et al. found that 68% of younger stage II breast cancer patients were offered a trial vs. 34% of the older patients (P < 0.001).32 It is presumed that older patients may need extra time and resources to access available clinical trials or they are often excluded because they do not meet eligibility criteria.31 Our results emphasize the importance of increasing access of older cancer patients to cancer genomic projects to increase the applicability of the findings to these patients.

Racial disparities in cancer incidence and survival have been well documented among various cancers. Although socioeconomic and cultural differences that differ between racial groups can explain some of the disparities, recent progress in cancer genomic sequencing allows for a molecular understanding.33,34 Genomic landscape differences that co-vary by race, a marker of ancestry, may influence cancer treatment. For example, one study reported that even after adjusting for smoking status and sex, race was still significantly associated with EGFR mutations.35 EGFR mutations were highly prevalent in Asians at 30% vs. 7% in Whites.36 In addition, results from a meta-analysis of randomised controlled trials have reported that compared with Caucasians, Asians have a higher survival and response rate to chemotherapy.37 In our study, the race distribution was notably dissimilar for 13/18 cancers, with 8/13 cancers having under-representation in individuals with Black race, which may translate to a distinct genomic landscape that may be under-represented for many cancer types. Notably, 8 of these 13 cancers had higher representation by individuals with Other race (Asian, American Indian, or Alaska Native). This over-representation may be due to TCGA cancers with small sample sizes where a relatively large proportion can be found even only with few cases in the Other population.

Stage is a well-established predictor of cancer prognosis and survival.38 Studies have also reported notable genetic variation in cancers by stage.39,40 In our study, stage dissimilarities existed for almost all cancer types (25/26). However, these identified differences between datasets may be due to the fact that only individuals  who had a resection procedure were included in TCGA.14 Individuals with unresectable cancers, such as cancers with advanced stage or metastatic cancer,41 did not meet the inclusion criteria of the program, which likely led to a lower stage distribution of the cases in TCGA compared to SEER. In addition, other differences may have contributed to stage differences including the sample eligibility requirements of only untreated first primary tumour samples being fresh frozen.14

To our knowledge, this is the largest study to compare clinical characteristics of TCGA cancer cases to a sample of the general population of U.S. cancer  cases. However, our study has limitations. No specific diagnosis criteria for each cancer type have been published for TCGA to our knowledge. Thus, the corresponding cancers in SEER were matched by cancer site and histology, and identified by ICD-O-3 primary site and histology/behavior code. Moreover, cases with certain cancers had missing race, stage, and survival months information. Particularly, 6/33 TCGA cancer types (THCA, LUSC, PRAD, COAD, READ, and UVM) had over 15% missing data on race, and 5/26 SEER cancers (BLCA, LIHC, MESO, CHOL, and UVM) had over 15% missing data on stage. In addition, for the race comparison, only 18 cancers in NAACCR were identified with sites matching to those of TCGA cases.

In conclusion, we found dissimilarities in the distributions of demographic and clinical characteristics between TCGA and general population cancer cases for the majority of cancers. Increased awareness of under-represented groups by researchers conducting cancer genomic research will allow for targeted efforts that increase the representativeness of genomic data that is important for precision medicine.