SEER and Gene Expression Data Analysis Deciphers Racial Disparity Patterns in Prostate Cancer Mortality and the Public Health Implication

A major racial disparity in prostate cancer (PCa) is that African American (AA) patients have a higher mortality rate than European American (EA) patients. We filtered the SEER 2009–2011 records and divided them into four groups regarding patient races and cancer grades. On such a partition, we performed a series of statistical analyses to further clarify the aforementioned disparity. Molecular evidence for a primary result of the epidemiological analysis was obtained from gene expression data. The results include: (1) Based on the registry-specific measures, a significant linear regression of total mortality rate (as well as PCa specific mortality rate) on the percentage of (Gleason pattern-based) high-grade cancers (PHG) is demonstrated in EAs (p < 0.01) but not in AAs; (2) PHG and its racial disparity are differentiated across ages and the groups defined by patient outcomes; (3) For patients with cancers in the same grade category, i.e. the high or low grade, the survival stratification between races is not significant in most geographical areas; and (4) The genes differentially expressed between AAs’ and EAs’ tumors of the same grade category are relatively rare. The perception that prostate tumors are more lethal in AAs than in EAs is reasonable regarding AAs’ higher PHG, while high grade alone could not imply aggressiveness. However, this perception is questionable when the comparison is focused on cases within the same grade category. Supporting observations for this conclusion hold a remarkable implication for erasing racial disparity in PCa. That is, “Equal grade, equal outcomes” is not only a verifiable hypothesis but also an achievable public health goal.


Materials and Methods
Study design. In this study, we used three sources of data, i.e., the SEER data, the TCGA digital gene expression data (Ex-1), and an aggregation of three GEO microarray gene expression datasets (Ex-2). The samples in all of these datasets compose patients of multiple races, facilitating disparity research, especially the test of the proposed hypothesis. Our statistical analysis focused on the samples from non-Latin European American (EA) and African Americans (AA) populations. The integration of the SEER data with the gene expression data was achieved by an approximate alignment of the cancer grades via Gleason patterns (GP). That is, the high-grade or low-grade cancers in the SEER data correspond to the GP-4 or GP-3 specimens in the gene expression data. Specifically, we partitioned the SEER samples into four groups, i.e., AA-LG, AA-HG, EA-LG and EA-HG, where HG and LG denote the high-grade and low-grade cancer (patient), respectively. Our primary results or conclusions for a specific research question were derived by performing pairwise comparisons of those four groups. The gene expression data were further analyzed to test if a biological basis, inferred from the stratification of gene expression levels across populations and Gleason patterns, exists for the SEER data-based results about mortality disparity patterns.
SeeR data. We retrieved prostate cancer SEER data from the 2009-2011 database, and then refined the data to generate a working dataset (SEER-WD), containing 86996 AA and non-Hispanic EA cancer cases collected by 48 registries (including 59 patients whose survival or follow-up times are unknown). Each registry represents a county or a parish in California, Louisiana and other five states. These registries are selected as each of them has at least 100 AA or EA patients documented during the studied time period (Table 1). Because the patients in the selected data entered into the survey during a short time span (i.e. three years), we can assume that the individuals of a specific race in each registry constitute a cohort whose socio-economic relevance is relatively strong. Moreover, the potential influence of the Hurricane Katrina occurring in Louisiana in 2005 could be largely alleviated regarding the studied period.
The SEER determines the stage of a PCa patient according to the histologic grade of his disease tissue. In the grading system, the codes I, II, III and IV denote "well differentiated", "moderately differentiated", "poorly differentiated" and "undifferentiated; anaplastic", respectively. The cases in these four grades respectively account for 1.1%, 42.3%, 56.3% and 2.8% of the total records in SEER-WD. About 88% of the cases have the survival or follow-up times falling into the interval of 36 and 72 months ( Supplementary Fig. S1). According to the SEER Program Coding and Staging Manual 2012 15 , the cancers coded with I, II and III have Gleason Scores (GS) ranking from 2 to 4, 5 to 6 and 7 to 10, respectively (GS corresponding to the code IV is missed in the Manual but it should be over 8). In this study, we combined I and II into the low-grade category (LG), and combined III and IV into the high-grade category (HG). The main difference between a HG patient and a LG patient is that a representative tumor specimen from the former but not from the latter contains Gleason patterns 4 or 5 (GP-4 or -5) as the primary or second prevalent ones. It is well known that tumorous cells in GP-4 and GP-5 are more aggressive than those in GP-3 [16][17][18] . Therefore, HG and LG could be considered as two prostate cancer "subtypes" with inequivalent lethality. calculation of mortality metrics. Total mortality rate (TMR) and prostate cancer specific mortality rate (PSMR) are calculated using the formula TMR = M/T and PSMR = M1/T, respectively. Here, T is the total number of the patients; M is the number of patients with the values in the "Vital status recode (study cutoff used)" column of the SEER data being "Dead"; and M1 is the number of patients with the values in the "SEER cause-specific death classification" column of the SEER data being "Dead (attributable to this cancer dx)". Digital gene expression data (Ex-1). Ex-1 contains the log2 transformed level-3 digital (RNA-seq) gene expression profiling of 333 TCGA prostate adenocarcinomas samples whose Gleason patterns (GPs) were reviewed/corrected by two pathologists 19 . A refined version of this dataset, which includes 65 GS-6(3 + 3), 102 GS-7(3 + 4), 78 GS-7(4 + 3) and 44 GS-8(4 + 4) tumors, is focused on in our study. TCGA quantified and normalized the gene expression levels using the RSEM (RNASeq by Expectation Maximization) method 20,21 . Microarray gene expression data (Ex-2). Ex-2 is a composite dataset containing the clinical and gene expression information of primary prostate cancers in three cohorts: a section of GSE21034 22 , GSE62667 23 , and GSE72291 24 . These three cohorts are respectively consisted of 131, 182 and 139 samples. Together, the numbers of GS-5, -6, -7, -8, -9, and -10 tumors are 1, 86, 265, 54, 42 and 2, respectively. There are two tumors whose GSs are missing in the clinical data. In order to reduce unnecessary complexities in statistical analysis and succinctly present the results, we further ignored the difference between GS-5 and GS-6 tumors (i.e. treat both of them as "GS-6") and the difference among GS-8, GS-9 and GS-10 ones (i.e. treat all of them as "GS-8"). The gene expression levels of all these tumors were measured by the same platform, i.e. Affymetrix Human Exon 1.0 ST Arrays. We first downloaded the raw data from the GEO database and then used the frozen Robust Multi-array Analysis www.nature.com/scientificreports www.nature.com/scientificreports/ algorithm 25 to summarize and normalize the transcriptomic quantities to generate gene-level expression profiling of each individual sample. Finally, the quantile normalization was repeatedly applied to the expression matrix of the 452 samples. www.nature.com/scientificreports www.nature.com/scientificreports/ Data augmentation. While Ex-1 and Ex-2 are the largest datasets available to us for studying racial disparities in gene expression of PCa tissues, our research is still subject to a challenge arising from data limitations. That is, the numbers of samples from the interested minority population, i.e. AA, is not sufficiently large. In both datasets, only 12-13% of tumor samples are from AA patients. This situation is further complicated as these tumors are distributed among four to five Gleason Grade-based categories. In particular, it is inappropriate to use the gene expression information of GS-7 tumors, which account for approximately a half of the samples, in a disparity analysis. The reason is that, for a GS-7 tumor, the Gleason pattern (GP-3 or GP-4) of the specimen used in the RNA-seq or microarray experiment is uncertain. This is different from a GS-6 or GS-8 tumor whose experimental specimen can be heuristically assigned to GP-3 or GP-4 category, respectively. On the other hand, we have identified a strong expression signature of 288 genes for distinguishing GP-4 specimens from GP-3 specimens in another work 26 . As a result, before performing the advanced analysis, we used a machine learning method (see the Statistical Method subsection) to partition the GS-7 tumor samples into the GP-3 and GP-4 specimens. In this way, we substantially augmented the clinical information of the gene expression datasets, which facilitates the comparison of gene expression programs between high-grade and low-grade cancers.
Statistical methods. We used linear regression to model the relationship between the proportion of high-grade cancers and mortality rate. The differences of survival time between two patient groups were evaluated by Cox-PH regression. T-test was used to identify the differentially expressed genes between two sample categories. The hierarchical clustering analysis was performed using Ward's method and the Manhattan distance. Fisher's exact test was adopted to test the independence of two different sample partitions and to evaluate the functional enrichment of significant genes. The employed software includes the relevant functions in R packages "stats", "gplot" and "survival", the David tool 27 , and an on-line available R function heatmap.3() (http://www. biostars.org/p/18211/).
We used the Support Vector Machine (SVM) (9) to predict the actual Gleason pattern (GP-3 or GP-4) of the specimen from a GS-7 tumor. A SVM model was trained on a dataset consisting of GS-6 and GS-8 samples. The svm() function in the R package "e1071" was implemented with the default parameters except for the class weights and kernel type. We specified the class weights as the reciprocals of the fractions of the GP-4 and GP-3 samples in the training set. The pattern of each GS-7 tumor was predicted twice using the linear and sigmoid kernels, respectively. Only the tumors with consistent predictions were kept for further analysis. Our preliminary study showed that such a double-kernel prediction and filtering can warrant the sensitivity and specificity being over 0.9.
All the data used in this study have been previously published or can be freely obtained from public resources. All analyses were performed using standard statistical methods in accordance with relevant guidelines and regulations. All protocols of experiments and information collection were approved by the National Institute of Health (NIH) when the original owners of the data carried out their studies. The gene expression data meet the minimum information standards. The informed consent was obtained from all prostate cancer patients, who were over 18.

Results
Disparity in the regression of tMR on pHG. Our results show that the nation-wide TMR and PHG for the EA population are 0.126 and 55.9%, respectively. The corresponding quantities for the AA population are 0.157 and 60.4%, respectively. Registry-specific TMR and PHG, called R-TMR and R-PHG hereafter, are calculated with respect to each individual registry as the "experimental" unit. For EAs, the R-TMRs range from 0.08 to 0.181 and R-PHGs range from 42.7% to 71.0%. For AAs, the R-TMRs range from 0.065 to 0.23 and R-PHGs range from 43.6% to 74.7%. According to a paired t-test, the difference between AAs and EAs is extremely significant in both R-TMR (p = 0.003) and R-PHG (p < 0.000001). As expected, a significant linear regression of R-TMR on R-PHG is observed in EAs (p < 0.01) and the Pearson correlation (r) between these two metrics is up to 0.38. However, such a relationship is not observed in AAs (Fig. 1).

Disparity in the regression of pSMR on pHG.
Our results show that the nation-wide PSMRs are 0.03 and 0.04 for EA and AA populations, respectively. Registry-specific PSMRs (R-PSMRs) for EAs and AAs range from 0.004 to 0.067 and 0.012 to 0.08, respectively. According to a paired t-test, the difference between AAs and EAs is significant (p < 0.01). Similar to the situation for R-TMR, a significant linear regression of R-PSMR on R-PHG is observed in EAs (r = 0.42, p < 0.003) but not in AAs (Fig. 2). patient age related disparity in pHG. The domain of patient ages at the initial diagnosis dates are partitioned into ten age segments (A-S), i.e. <45, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, and > = 85 years (See Supplementary Fig. 2A, B for the distributions). Race and A-S specific PHGs are calculated and are depicted by a scatter plot. A smooth spline method is used to generate two curves for AA and EA populations, respectively. As shown in Fig. 3A, the area between these two curves assembles a "dolphin" shape with AA curve at the top, EA curve at the bottom, <45 A-S points at the head end and > = 85 A-S points at the tail end. This indicates that PHG is consistently higher in AAs than in EAs but the differences are subtle for the early-onset patients and are almost ignorable for the later-onset patients.
Patient age and tumor grade-related disparity in cancer mortality. Two subsets are extracted from SEER-WD. The first (WD-1) contains 10077 patients who died during the follow-up periods regardless of the cause of death. The second (WD-2) contains 2825 patients whose deaths were attributable to prostate cancer. Using the same method employed in the previous section, we graphically depict the association patterns between PHG and patient age. The patterns obtained from WD-1 and WD-2 are demonstrated in Fig. 3B, C, respectively.
In particular, open circles whose diameters are proportional to the race and A-S specific quantities of two mortality metrics (i.e. TMR and PSMR) are added to these two plots, respectively. The graphics provide us with the Scientific RepoRtS | (2020) 10:6820 | https://doi.org/10.1038/s41598-020-63764-4 www.nature.com/scientificreports www.nature.com/scientificreports/ following epidemiological implications. First, in the EA population, the patients in the middle age segments (spanning 50-74 years) have a lower PHG than those in the end age segments (<50 or > =75 years), regardless of the death causes. But this is not the case for the AA population, in which the PHG quantities of the patients who died of prostate cancer are almost consistent across the entire age span except for a very low value being observed at the < =45 A-S. Second, disparity regarding the PSMR metric is substantial and consistent over patient ages. Third, as expected, the quantities of these two mortality metrics increase with patient ages. The death cases are mainly observed in the patients whose ages at the initial diagnosis are over 70 years.
Demographic disparity patterns in grade-related survival stratification. Among the registries included in SEER-WD, sixteen have at least 280 patients in both EA and AA populations. Geographical  www.nature.com/scientificreports www.nature.com/scientificreports/ area-specific survival analysis is performed on the data of these registries. The statistical significance of the survival stratification of the AA-HG and EA-HG groups, as well as the stratification in AA-LG and EA-LG groups, is evaluated. The results for all-causes mortality are shown in Fig. 4, in which the individual plot displays the Kaplan Meier curves of the patients in a registry. The p-values obtained from Cox-PH regression analysis, in which the age of a patient at the initial diagnosis is included as a covariate, are also printed in the plots. With the criterion of p-value <0.05, the results can be divided into four classes, denoted by I, II, III and IV, respectively. I: Racial survival disparity is observed in patients of both HG and LG (cancer) groups, which is the pattern present in Georgia::Fulton County, Louisiana::East Baton Rouge Parish and other three registries. II: The disparity is observed only in patients of HG group, which is the pattern present in California::Alameda County, New Jersey::Essex County and other three registries. III: The disparity is observed only in patients of LG group, which is the pattern in Louisiana::Caddo Parish. IV: The disparity is not observed in patients of both groups, which is the pattern present in California::Riverside County, Georgia::DeKalb County and other two registries.
The results of survival analysis for prostate cancer specific mortality are shown in Supplementary Fig. S3, in which the patients who died from other causes or whose dead cause categories could not be determined from the data are considered as "censored" cases. The registry-specific survival stratification patterns in Supplementary  Fig. S3 are largely consistent with those in Fig. 4, but significant racial disparities (p1 or p2 < 0.05) are only found in 4 (out of 16) registries. The results obtained from the analysis of dataset Ex-1 are shown in Fig. 5A-D. With the cutoffs of fold change being larger than 2 and p-values being less than 0.01, only 49 and 51 significant genes are identified in AA&GP-3 versus EA&GP-3 and AA&GP-4 versus EA&GP-4, respectively. No KEGG pathway is over-represented by those genes. The lists (i.e. L3 and L4) of significant genes for AA&GP-3 versus AA&GP-4 and EA&GP-3 versus EA&GP-4 are much longer, containing 348 and 335 genes, respectively. Three KEGG pathways are over-represented (p < 0.05, Fold Enrichment> 8) by the 105 common genes of L3 and L4. They are "hsa04978: Mineral absorption", "hsa04974: Protein digestion and absorption" and "hsa04512: ECM-receptor interaction". A main difference between L3 and L4 is that the former, but not the latter, is enriched (p < 0.001 and Fold Enrichment> 2) with the genes involved in "hsa04151:PI3K-Akt signaling pathway".

contributions of grade and race factors to gene expression variability. This set of analyses is sep
Apparently, the genes with different expression levels between AA and EA specimens of the same Gleason pattern are rare, and the contribution of the Gleason patterns to the variability of gene expression activity is much Figure 3. Patient age-related disparity patterns in PHG and cancer mortality metrics. The domain of patient ages at the initial diagnosis is divided into ten segments (i.e. <45, 45-49, 50-54, 55-59,60-64, 65-69, 70-74, 75-79, 80-84 and > =85 years). Race-specific PHG and mortality metrics are calculated for individual age segments. Red and black points/lines represent the EA and AA populations, respectively. In plot A, the information of all patients is used in calculating PHG. In plot B, PHG is calculated with the information of all the patients who died during the follow-up periods; and the diameter of the circle at a data point is proportional to the corresponding TMR (total mortality rate of patients) with the reference being printed at the top right corner. In plot C, PHG is calculated with the information of the patients who died of prostate cancers during the follow-up periods; and the diameter of the circle at a data point is proportional to the corresponding PSMR (prostate cancer specific mortality rate) with the reference being printed at the top right corner. (2020) 10:6820 | https://doi.org/10.1038/s41598-020-63764-4 www.nature.com/scientificreports www.nature.com/scientificreports/ more significant than patient races. This conclusion is verified by the same comparison analysis performed on the composite microarray gene expression data ( Supplementary Fig. 4A-D). It is also supported by the results of clustering analysis (Fig. 6, Supplementary Fig. 5), which show that the tumor clusters established on the expression profiling of 1000 genes with top expression variability are associated with Gleason patterns but not patient races.

Discussion
The primary conclusion of this study is that, prostate tumors being more lethal in AAs than in EAs is reasonable regarding AAs' higher PHG statistic, while high grade alone could not imply aggressiveness. However, this notion is questionable when the comparison is concentrated on tumor samples within the same grade category. The following are the supporting observations and evidence for this conclusion. First, based on the records of 48 www.nature.com/scientificreports www.nature.com/scientificreports/ representative registries (i.e. counties and parishes), the nation-wide PHG in AAs is ~1.2 times of that in EAs; Second, for the patients with tumors in the same grade category (HG or LG), the survival stratification between races is not significant in most counties (or parishes) located in different states; and third, the genes with different expression levels between AA and EA patients with cancers in the same grade category are very limited in number. These results, especially the second one, hold a remarkable implication for erasing racial disparities in prostate cancer. That is, "Equal grade, equal outcomes" is not only a verifiable hypothesis but also an achievable public health goal. In some areas, such as DeKalb County in Georgia and Jefferson Parish in Louisiana, AA and EA patients have the similar outcomes. The experiences of these regions in patient surveillance and therapy could be useful to some other areas, such as Fulton County in Georgia and Wayne County in Michigan, where the mortality disparity is substantial. The reason is that, while the between-race survival differences of same-grade patients observed in the latter counties or parishes reflect the findings of several publications 28,29 , they may purely be an artifact of imbalances in treatment.
Our opinion about the achievability of the "Equal grade, equal outcomes" goal is largely supported by the results of a recent publication 30 , which suggests that racial inequality in prostate cancer outcome is mainly due to socioeconomic imbalances rather than biological factors. Based on a comprehensive analysis of the data of three cohorts, i.e. SEER, a pool of four randomized clinical trials and an equal access health care system, Dess et al. found that prostate cancer-specific mortality by race did not appear to differ in the equal-access health care system, and the outcomes were even better for black than for white patients in well-designed and well-conducted clinical trials. Regarding the analysis performed on the SEER data, Dess et al. used a propensity model to adjust the statistics of patient outcomes for socioeconomic imbalances and clinical factors such as biopsy Gleason score (corresponding to "grade" in our study). As a result, they did not identify substantial racial disparity in prostate cancer specific mortality.
On the other hand, our study and results are subject to the limitations of using the SEER data and some conclusions could be weakened. Especially, the causes of ~20% of death cases are missing in the data but the relevance of prostate cancers in those deaths cannot be excluded. This could lead to an underestimate of PSMR and temper the benefit of a PSMR-based survival analysis. Moreover, SEER does not have a mechanism to guarantee the www.nature.com/scientificreports www.nature.com/scientificreports/ consistence of the protocols in measuring the Gleason scores (GS) of tumors across hospitals and, therefore, miscoding for GS-based grades of tumors is unavoidable. This potentially compromises the accuracy of the disparity evaluation performed in this study.
There are other points relevant to the aforementioned conclusion worthy of further discussion. The first is the biological functions of the genes that show "significant" differences between the AAs' and EAs' tumors. Using the expression dataset Ex-1 (or Ex-2), we identify 49 and 51 (or 50 and 74) differentially-expressed genes in the contrasts, AA&GP-3 versus EA&GP-3 and AA&GP-4 versus EA&GP-4. The functional enrichment analysis of those genes does not show any relevance to tumor mortality. Those genes also hardly overlap with the 46 significant genes identified in a previous study on racial disparity of gene expression 10 , which are enriched with a few gene ontology terms related to immune response. The second is whether there is a disparity in the mutation spectrum of tumor cells. Because of the lack of a predominant cancer gene which is widely mutated over prostate tumors 31 and the limited sample size of AA patients, our preliminary analysis of the TCGA somatic mutation data does not identify any other genes with significantly different mutation frequencies between AAs and EAs. Although ETS fusions, predominated by TMPRSS2-ERG fusions, are two-times popular in EAs' tumors compared to AAs' tumors [32][33][34][35] , their effect on patient outcomes is controversial [36][37][38] . The third is the possibility of attributing the observed disparity in mortality rate to the potential disparity in cancer progression. While, to our knowledge, this point has been barely addressed by previous studies, the possibility is small in our opinion. The reason is that, in general, a GP-3 tumor cannot directly progress into a GP-4 tumor 16,39 .
Due to the complex demographical and socio-economic factors as well as the measurement errors arising from the limited sizes of AA patients in some registries, it is no wonder that there is a substantial variability in R-PHGs and R-TMRs, i.e. registry-specific PHG and TMR measures. However, a positive correlation between the R-PHGs and T-TMRs is expected if there is no demographical disparity in patient surveillance and therapy. Here, we do observe such a relationship in EAs but not in AAs. This represents a unique inequality. The underlying reason may be that, among AA communities, there is substantial variability in access to effective surveillance and therapy that masks the effect of cancer grade on patient survival. Therefore, this disparity could be attributed to AAs' poor socio-economic situation in some communities.
It is well known that PCa patients with tumors diagnosed before the age of 50 years are rare and that such early-onset patients are more frequent in AAs than in EAs. According to the records of the working dataset SEER-WD, 2030 EA men and 1105 AA men are within this age category (i.e. <50 years). They account for 3.0% and 6.1% of the EA patients and AA patients, respectively. Importantly, we find a relatively lower PHG in those early-onset AA patients whose death is attributable to prostate tumors. As shown in Fig. 3C, while the PHG quantity of those AA patients is up to 83% but this level is still lower than that of the patients in other age segments. This pattern sharply contrasts with that observed in EAs, where nearly all of the early-onset patients who died of prostate cancer have the high-grade tumors. The EA's pattern would be expected because a low-grade prostate The rows represent genes and the columns represent tumors. The "score-bar" indicates the Gleason patterns of tumor specimens with red being GP-4 and orange being GP-3. The "race-bar" indicates the races of patients with black being AAs and white being EAs. The printed Odds ratio and p-value are for the associations between the tumor clusters (k = 2) and Gleason pattern categories (GP-3 and GP-4). The associations between the tumor clusters and patient races are not significant. As such, those statistics are not reported.