Robust and interpretable PAM50 reclassification exhibits survival advantage for myoepithelial and immune phenotypes

We introduce a classification of breast tumors into seven classes which are more clearly defined by interpretable mRNA signatures along the PAM50 gene set than the five traditional PAM50 intrinsic subtypes. Each intrinsic subtype is partially concordant with one of our classes, and the two additional classes correspond to division of the classes concordant with the Luminal B and the Normal intrinsic subtypes along expression of the Her2 gene group. Our Normal class shows similarity with the myoepithelial mammary cell phenotype, including TP63 expression (specificity: 80.8% and sensitivity: 82.8%), and exhibits the best overall survival (89.6% at 5 years). Though Luminal A tumors are traditionally considered the least aggressive, our analysis shows that only the Luminal A tumors which are now classified as myoepithelial have this phenotype, while tumors in our luminal class (concordant with Luminal A) may be more aggressive than previously thought. We also find that patients with basal tumors surviving to 48 months exhibit favorable continued survival rates when certain markers for B lymphocytes are present and poor survival rates when they are absent, which is consistent with recent findings.


INTRODUCTION
Multiparametric genetic tests such as the PAM50/Prosigna Risk of Recurrence (ROR) for breast cancer prognostication are becoming commonplace. 1,2 However, due to limited accuracy and poor concordance with biological phenotypes, their clinical utility is still under investigation. 3 In this paper, we address these issues in the context of one of the most prevalent assays, the PAM50 ROR, which is mainly driven by an intrinsic subtype classification along a 50-gene mRNA expression profile. We reclassify these profiles using topological data analysis, incorporating prior knowledge of biological phenotype (basal/luminal stratification). Unlike the five traditional PAM50 intrinsic subtypes, our seven classes are accurately defined by clear patterns of activation and inactivation of gene groups directly interpretable in terms of specific normal mammary cell types: basal, luminal/ER, myoepithelial, and Her2related gene groups.
The basal/luminal terminology refers to mammary cell differentiation from basal-epithelial cells near the basement membrane to the more differentiated luminal-epithelial cells near the lumen or ducts. It was the basis for the systematic molecular classification of breast cancer initiated by Perou et al. 4 Myoepithelial refers to a mammary cell type playing a key role in breast duct secretion. 5,6 Overexpression of Her2 (ERBB2) and a group of related genes marks the Her2 + cohort wellknown since the 1990s for highly favorable response to the drug trastuzumab (herceptin). Figure 1 summarizes the history of the molecular classification and our contribution. Table 1 lists the new classes.

Clearly defined 50-gene signatures
The signature classes we defined show partial concordance with the PAM50 subtypes, with a Normalized Mutual Information (NMI) of 0.19 (29.1 times the maximum NMI found in 10,000 random permutation bootstrapping trials) (Fig. 2). However, our classes show tighter clustering along the 50-gene profile: the k-mean for the PAM50 subtypes is 87.9% of the total variance, and for our classification is only 82.7% (both using the L1 norm). To assess the quality of the signatures themselves, we consider the average silhouette width 7 of each class. The silhouette width is the average distance a(i) between a sample i and the cluster to which it belongs subtracted from the smallest average distance b(i) between i and the other clusters, normalized by max (a(i), b(i)). The average silhouette width over a given cluster (abbreviated SW) measures the tightness of the cluster with respect to the clustering scheme, with larger SW (closer to 1) indicating a good cluster and smaller SW (closer to −1) indicating a poor cluster.
Our Luminal class SW = 0.151 is greater than the PAM50 Luminal A SW by 0.107; Luminal/Basal SW = 0.131 is greater than the PAM50 Luminal B SW by 0.112; Myo/Luminal SW = 0.0422 is greater than the PAM50 Normal SW by 0.0432 (silhouette widths range from −1 to 1). The SWs of our Her2 and Basal/Myo SWs are very close to the SW of the PAM50 Her2 and Basal subtypes.
As shown in Fig. 2  History of the molecular classification of breast cancer. Names are shown at the chronological level at which they were introduced. The Her2+ breast tumors were already well-known in the 1990s for highly favorable response to the drug trastuzumab (herceptin), which was approved by the FDA for metastatic Her2+ breast cancer in 1998. The hierarchical clustering of Perou et al. 4 used genes whose expression differentiates between samples from different tumors better than between samples from the same tumor, finding four main classes: ERBB2+ (or Her2+), Basal, Luminal, and Normal breast-like. Sorlie et al. 26 explicitly incorporated clinically relevant outcome data such as overall survival, uncovering three Luminal subtypes, Luminal A, B, and C. Luminal A has higher overall survival than Luminal B, and Luminal B has higher overall survival than Luminal C. Later investigators found only two Luminal subtypes to be sufficiently robust. Parker et al. 27 introduced the 50 gene set that became known as the PAM50 (Prediction Analysis of Microarray) and introduced a straightforward centroid-based classifier for breast tumor RNA expression patterns along the PAM50 with five classes: Basal, Her2, Luminal A, Luminal B, and Normal. The authors used this classification as a key component in the model that became the Prosigna predictor of Risk Of Relapse (ROR). Prat and Perou 9 introduced the Claudin-low subtype carved largely out of the Basal group. The authors find that the Claudin-low subtype has poor prognosis compared to Luminal A, but no worse than the other subtypes. The Topological Data Analysis of Nicolau et al. 17 confirmed the distinction between more luminal, more basal, and more normal-like subtypes along branches of a graph structure modeling the distribution of breast tumor samples. They found a subgroup of patients exhibiting a very high survival rate, largely characterized by expression of MYB. Our proposed classification uses the method of Nicolau et al. 17 and incorporates gene sets and priors (e.g., the basal-to-luminal stratification) known to be relevant to breast cancer biology. (Below right) Our proposed system with seven classes defined by four elementary phenotypes (see also Figs 2, 7)  (Fig. 4).
To investigate the Myo/Luminal class further, we drew upon the classification of normal mammary cell types of Santagata et al. 5 in terms of marker genes/proteins ESR1, AR, VDR, KRT5, MKI67, KRT18, MME, SMN1, and TP63. Figure 5 shows the Mapper analysis of the 290 normal breast tissue samples of the GTEx RNA expression database. 8 We found normal tissue expression patterns were similar to one of our class' signatures along the PAM50, and also similar to one of the cell type patterns of Santagata et al. 5 along their marker genes. One of the clearest patterns was activation of only the basal gene group along the normal cell type denoted L1, characterized by expression of the proliferation marker MKI67. In addition, a clear subset of samples, displaying a superposition of the pattern of normal myoepithelial cell-type M2 and normal cell-type L7 (KRT5+/VDR+), also displayed the signature Myo/Luminal/Her2. The main characteristic of M2 is expression of TP63. We found that TP63 expression can be used as a single marker for the Myo/Luminal class (specificity: 80.8%, sensitivity: 82.8%).
Basal/myoepithelial (triple-negative) subclass with immunerelated survival advantage Since the Myo/Luminal class is heterogeneous with respect to FOXC1, MIA, and PHGDH expression, we expected that FOXC1 +/MIA+/PHGDH+ would be associated with a more aggressive phenotype (Fig. 6). After all, these genes are highly expressed in the PAM50 Basal subtype (Basal/Myo). We found that while this is true for the first 48 months after diagnosis, the FOXC1+, MIA+, and PHGDH+ phenotypes all showed very favorable survival rates contingent on survival to 48 months (Fig. 6). We hypothesized that this phenomenon might generalize to the PAM50 Basal subtype. To test this, we sought genes from the set of 18,543 genes available for the METABRIC cohort which would separate the longterm and short-term survivors in the FOXC1+/MIA+/PHGDH+ group. The 100 most significant genes with respect to the t test for difference of mean expression (−log 10 (p) value > 6.7) included the genes coding for the B-cell antigen receptor complex-associated protein alpha and beta chains, the B-cell-specific coactivator OBF-1, the pre-B-lymphocyte-specific protein-2, and B-cell maturation factor (CD79A, CD79B, POU2AF1, IGLL1, and TNFRSF17), as well as CD38, expressed by many immune cells. (In fact, CD79A is one of the major positive expression markers for the Claudin-low subtype introduced by Prat and Perou. 9 The Claudin-low subtype and our CD79A+/CD38+/IGLL1+ type are both subgroups of the Basal group).   TP63 is one of the myoepithelial markers in the work of Santagata et al. 5 and also a key marker for our Myo/Luminal class. From the Kaplan-Meier analysis in Fig. 4, we conclude that TP63 expression confers a survival advantage even greater than the well-known survival advantage conferred by PGR expression across the whole METABRIC cohort.
The PAM50 subtype most resembling the Myo/Luminal class is Normal-like. The status of the Normal-like subtype has been uncertain since its introduction by Perou et al. 4 It is often thought to represent non-cancer tissue which is incidentally present in bulk tissue samples. For example, the PAM50 classifier uses actual normal tissue samples to train the centroid of the Normal class. We found a significantly lower death rate after 4 years for patients with basal tumors expressing key B-lymphocyte-related markers CD79A, CD79B, POU2AF1, IGLL1, and TNFRSF17. This group is 80.3% of all patients with basal tumors surviving to 4 years. We conclude that the remaining 19.7% of these patients, with basal tumors lacking these markers, are still at high risk of mortality. This observation is consistent with the finding of Rueda et al. 13 that a certain subgroup of triple-negative breast cancers can be defined which rarely recurs after 5 years.
Regarding future work, responses to specific drugs or therapies should be investigated to decide whether some patients with Luminal but not Myo/Luminal tumors are under treated. Moreover, future work should address the question of why the four main gene groups appear. One possible explanation is that the four prototypical expression patterns Luminal, Basal, Myoepithelial, and Her2-related represent types of clones derived from an original transformation, and the combinations of these prototypes correspond to a certain clonal mixture. Another possibility is that the observed expression patterns are superpositions of actual tumor expression, expression of tumor microenvironmental normal cells with types related to the four prototypes, or expression patterns similar to original normal ancestor cells.
New techniques of single-cell sequencing, potentially in conjunction with tumor-level spatial mapping, may provide answers to these questions. Finally, the differential prognosis among triple-negative tumors observed with respect to the B-lymphocyte-related stratification suggests that the immune systems of~51% of patients with triplenegative tumors can naturally and reliably mount a successful response to the tumor. If this hypothesis is correct, a longitudinal study monitoring the immune system of triple-negative patients should be able to discover exactly what response is mounted, which could lead to potential new therapies that induce this natural response.

Topological data analysis
Topological data analysis (TDA) methods, employing ideas from the mathematical field of topology, have gained popularity in recent years. More precisely, discrete algorithmic counterparts of topological concepts have emerged in response to the availability of large data sets harboring hidden structures. Mapper, 14 a discrete analogue of a Morse-theoretic analysis of a manifold with respect to a height function, or "filter" function, has received particular attention with regards to both its theoretical foundations 15,16 and, following Nicolau et al., 17 its application to cancer CD79A (immunoglobulin-alpha), CD38, and IGLL1 (immunoglobulin lambda-like polypeptide 1). FOXC1+/MIA+/PHGDH+ is also observed in the PAM50 Basal subtype. Within the Basal subtype, CD79A+, CD38+, and IGLL1+ confer a significant survival advantage after 48 months genomics. [18][19][20] Mapper builds a graphical summary of a given sample set with respect to a chosen stratification (filter) function.
We use three sample sets: TCGA, METABRIC, 21,22 and GTEx. 8 The 1082 TCGA and 1904 METABRIC mRNA expression z-score data sets along the PAM50 gene set were retrieved from cBioPortal. 23,24 The 290 GTEx normal breast data set was downloaded from the GTEx portal; metadata supporting data files may be found in. 25 Due to the retrospective nature of this study using only publicly available data, ethics approval for the study was not required.

Filter function
The "filter function" or initial stratification is taken to be a basal-luminal epithelial differentiation score, calculated as the average expression zscore of luminal-epithelial markers (XBP1, FOXA1, GATA3, ESR1, and ANXA9) minus the average expression z-score of basal-epithelial markers (KRT17, KRT5, DST, ITGB4, LAMC2, CDH3, LAD1, and ITGA7). Selected largely on the basis of Perou et al., 4 the basal markers are all associated with anchorage of epithelial cell layers to the basement membrane, while the luminal markers are all expressed in well-differentiated or mature luminal epithelial cells.
The Mapper graph and 50-gene signatures determined from the METABRIC breast tumor samples are shown in Fig. 7. Correlation-based clustering along small contiguous subsets with respect to the graph yielded the five main gene groups.
A simple classifier was constructed from the table of observed signatures (see Fig. 7) as follows: For a given sample and a given signature or profile, the average values for each gene group are calculated, then added together with the signature signs as weights. The resulting number is a similarity score between the sample and the signature. The sample is assigned to the highest-scoring signature.
Finally, the classes and gene groups shown in Fig. 2 were adjusted: The two myoepithelial gene groups were merged, the Myo/Luminal A and Myo/Luminal B classes were merged as a result, and Luminal expression was used to delineate classes Basal/Her2 and Basal/Luminal/Her2.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
The data generated and analysed during this study are publicly available in GitHub as described in the following data record: https://doi.org/10.6084/m9. figshare.9199289. 25 The study used three publicly available datasets and the raw data can be accessed from cBioPortal at https://identifiers.org/cbioportal: brca_metabric (METABRIC data) and at https://identifiers.org/cbioportal: brca_tcga_pan_can_atlas_2018 (TCGA data). The raw normal breast data can be accessed from the GTEx portal at https://gtexportal.org/home/datasets. 8,21,22

CODE AVAILABILITY
The code written in Python and R is available upon request. The analysis methodology is described in detail in the Supplementary Information file. Fig. 7 Mapper analysis of the 1904 METABRIC breast tumor samples, along the PAM50 gene set, using the basal-luminal score as filter function. The circular nodes represent clusters in the strata or bins defined by the filter function at the chosen level of resolution. For example, there are three clusters in the stratum shown in yellow; two clusters shown higher and labeled with unsupervised signature number [4], and one cluster shown lower labeled with unsupervised signature number [5]. All three have the same basal-luminal score range, indicated by color. In Table 1, the salient signatures are recorded. These signatures differ slightly, in two ways, from the seven classes we finally propose as in Fig. 1. First, for the sake of simplicity we merge the two myoepithelial-related gene groups (b, c) into a single-gene group, consequently merging Myo/Luminal A and Myo/Luminal B into Myo/Luminal. Second, on account of the salient signatures observed in the heatmaps in Fig.  2, we split Her2/Basal [5] into Her2/Basal and Luminal/Basal/Her2. Where blanks appear, the corresponding gene group (a-e) is neither uniformly positively nor uniformly negatively expressed