Machine learning for RNA sequencing-based intrinsic subtyping of breast cancer

Stratification of breast cancer (BC) into molecular subtypes by multigene expression assays is of demonstrated clinical utility. In principle, global RNA-sequencing (RNA-seq) should enable reconstructing existing transcriptional classifications of BC samples. Yet, it is not clear whether adaptation to RNA-seq of classifiers originally developed using PCR or microarrays, or reconstruction through machine learning (ML) is preferable. Hence, we focused on robustness and portability of PAM50, a nearest-centroid classifier developed on microarray data to identify five BC “intrinsic subtypes”. We found that standard PAM50 is profoundly affected by the composition of the sample cohort used for reference construction, and we propose a strategy, named AWCA, to mitigate this issue, improving classification robustness, with over 90% of concordance, and prognostic ability; we also show that AWCA-based PAM50 can even be applied as single-sample method. Furthermore, we explored five supervised learners to build robust, single-sample intrinsic subtype callers via RNA-seq. From our ML-based survey, regularized multiclass logistic regression (mLR) displayed the best performance, further increased by ad-hoc gene selection on the global transcriptome. On external test sets, mLR classifications reached 90% concordance with PAM50-based calls, without need of reference sample; mLR proven robustness and prognostic ability make it an equally valuable single-sample method to strengthen BC subtyping.


S1.1 TCGA -Breast Invasive Carcinoma
The expression dataset of interest includes 817 RNA-seq Version2 RSEM4 profiles of breast invasive cancer samples, subject to upper quartile normalization and log2-transformation. Based on Ciriello et al. (2015) supplementary materials, RNA sequencing was performed at the University of North Carolina at Chapel Hill on the Illumina HiSeq 2000; the RNA-seq profiles under study were obtained from levels of transcripts sequenced genome-wide with the Illumina mRNA-seq method and processed as follows. Resulting sequencing reads were aligned to the human hg19 genome assembly using MapSlice. Gene expression was quantified for the transcript models corresponding to the TCGA GAF 2.13 using RSEM4 and normalized within samples to a fixed upper quartile (UQ normalization). For our analyses, Upper quartile normalized RSEM data were log2 transformed. Genes with a value of zero following log2 transformation were set to the missing value, and genes with missing values in more than 20% of samples were excluded from downstream analyses, in compliance with what Ciriello et al. did in their work in 2015. Thus, ultimately, we used a dataset comprising 19,737 genes in each sample. Furthermore, Ciriello et al. performed a standard PAM50-based classification for the samples of the dataset, whose subtype calls are reported in Table S1.

S1.2 GEO dataset GSE96058
Illumina paired-end mRNA-sequencing and expression estimation were performed for a cohort of 3,273 breast cancer samples from the Multicenter Sweden Cancerome Analysis Network-Breast Initiative (Brueffer et al., 2018) and collected under GEO dataset accession number GSE96058. Gene expression data is made of FPKM profiles generated using Cufflinks 2.2.1 software. The resulting data was post-processed by collapsing on 30,865 unique gene symbols (sum of FPKM values of each matching transcript), adding to each expression measurement 0.1 FPKM, and performing a log2 transformation. PAM50 subtyping was performed by Brueffer et al. according to the standard PAM50-based classification, using a fixed reference selected to match the original cohort used in Parker et al. (2009). The distribution of subtype calls is reported in Table S1.

S1.3 PanCA dataset and GEO dataset GSE81538
The PanCA dataset includes 236 breast cancer mRNA-seq profiles selected from Pan Cancer Atlas, without overlaps with the samples already included in the used TCGA dataset. They were treated with RSEM pipeline, UQ-normalized, and log2-transformed. The GSE81538 dataset, instead, contains 405 breast cancer mRNA-seq profiles subject to FPKM normalization and log2-transformation and collected under GEO dataset accession number GSE81538. For both datasets, published subtype calls, assigned according to standard PAM50-based classifications by the original collectors, are available and reported in Table S1. Figures S1 and S2 show the use of the described datasets in the computational path following illustrated. Figure S1. Emulation and machine learning paths. Parallel workflows over the TCGA dataset. Figure S2. PAM50-based and machine learning paths. Roles of all the available datasets using their disclosed subtype calls and target labels. S2 PAM50 emulation S2.1 Missing gene expressions For our PAM50 emulation, we needed to restrict the gene set of interest to the PAM50 panel. Consequently, in whichever case we had to face a missing gene expression data issue, we solved it on log2-space as follows.
Missing log2-values were excluded from reference sample estimation by defining modified functions that compute mean, median and standard deviation over a restricted subset of all the values. In this way, whether a specific gene had an NA (not available) value among one of the samples under examination, the contribution of that sample for that specific gene would be discarded from the computation of the overall mean, median, or standard deviation. On the contrary, to compute the correlations between a sample profile and each subtype centroid, we substituted missing values with zero-values, since each NA value was in its turn caused by the absence of the absolute gene expression value for that specific gene in the involved sample. This replacement is a standard use also in other pipelines for the treatment of RNA-seq counts and in NanoString data handling, where it is suggested to substitute null or below 1 absolute gene expression values with a unitary constant bias (i.e., a null log2-value).
S2.2 Subtyping on TCGA dataset PAM50 classifications were performed using the class centroids developed by Parker et al. in 2009 and reported in Table S2. Performing PAM50 classification using ten references computed as medians of ten random subsets with 60% ER+/40% ER-proportion did not bring complete nor even over 90% of concordance with the subtype calls of Ciriello et al. (2015) (84%-87%). On the contrary, using the medians of ten random subsets with the same balanced ER+/ER-proportion of the subset of samples employed by Ciriello et al. (2015), we found an average concordance of 95%. Finally, using a subset built by including 262 samples within the specific sample set selected by Ciriello et al. (2015) brought nearly perfect concordance (99.3%), although we had to exclude from the median computation 52 samples used in Ciriello et al. (2015) but not available within the 817 under study.
Then, ten additional PAM50 classifications were performed with references built as an average of within-class averages (AWCAs), using each time as starting class assignments the subtype calls obtained from the previous PAM50 classifications involving the random subset with 60% ER+/40% ER-. The process is schematized in Figure S3. This reference building procedure aims to increase robustness through averaging, and effectively led to significantly more stable and accurate results (91%-93%) in the corresponding PAM50 classifications, even without any care to the originally employed subset chosen by Ciriello et al. (2015). Eventually, we built other random 60% ER+/40% ER-subsets with progressive size halving (from 400 to 25 samples), as to investigate the robustness of AWCA reference construction. For each size, from ten random subsets we computed the corresponding median-based PAM50 classifications and we used them to build ten new AWCA references. For any size, PAM50 classifications based on the newly generated AWCA references showed much lower dispersion and were approximately 5% more concordant with the already published calls than the corresponding median-based classifications ( Figure S4). Furthermore, AWCA references brought robust results with subsets of 200 samples and even using 50 samples only; conversely, median-based PAM50 classifications needed a subset of 400 samples to reach stable results, while subsets of only 50-25 samples become really critical. Considering that the starting subtype assignments do not affect relevantly the final PAM50 classification in case of the AWCA strategy, we eventually built two average references within the TCGA dataset, according to the published subtype calls of Ciriello et al. (2015): one excluding Normal-like class and one including it. These AWCAs were used for two other PAM50 classifications of the TCGA dataset. Although the reference built as the average of only the 4 intrinsic subtype classes led to higher classification accuracy, in both cases the overall error rate fluctuated between the 7% and 9%, keeping over 90% of concordance. Non-concordant calls involved primarily Luminal A (LumA) and Luminal B (LumB) classes, sharing some similar molecular traits from being Luminal tumors; furthermore, most parts of the disagreements were coincident with the ones already found with the previous PAM50 classifications, based on the subsets under study. Confusion matrices are reported in Figure S5. All previous results confirmed that the samples used for building the reference may affect meaningfully the subsequent PAM50 classification and may lead to some inconsistencies in multiple instances of subtype calling. However, most of the inconsistencies were widely recurrent across reference changes, and discordant classifications typically involved samples showing comparable correlations with more than one subtype, as clearly emerged from Figure S6. Notice that the average correlations are here computed considering the ten PAM50 classifications using as references the medians of different 60% ER+/40% ER-subsets (subset size 400), whereas the cases of at least a discordant subtype call are evaluated considering also the target label, i.e., the Ciriello et al. subtype. Figure S6. Discordant subtype calls in multiple PAM50 classifications mainly emerge in case of comparable average correlations with more than one subtype. Figure S7 shows the distribution of subtypes when concordant calls are experienced within 10 PAM50 classifications and the Ciriello et al. subtypes, once the adopted strategy of reference building is fixed. Specifically, the concordant calls with the Ciriello et al. subtypes for the ten PAM50 classifications using as references the medians of different 60% ER+/40% ER-" subsets are reported in blue, whereas, the concordances of the PAM50 classifications using AWCA references are reported in orange. Finally, without considering the concordance with Ciriello et al. subtype calls, Figure S8 reports the maximum number of concordant calls within 10 PAM50 classifications, when the strategy of reference building is fixed. Figure S8. Amount of concordant calls in 10 PAM50 classifications, using two different strategies for reference computation.
In conclusion, discordant classifications involved mainly samples having comparable correlations with more subtypes and for some samples, the boundary between two or more classes appeared labile, maybe due to the possible coexistence of mixed traits.
This non-separability among subtypes also emerged clearly from the Principal Component Analysis that we performed independently for the RSEM values of the TCGA dataset and for the FPKM profiles of the GSE96058 dataset. As we can see from the graphs in Figure S9 showing the first two principal components, RSEM and FPKM data do not mix well. This compelled us to learn independently the parameters of the models dealing with RSEM or with FPKM data, as discussed later in the supervised learning context. S3. In-silico Prosigna emulation and Risk of Recurrence estimation

S3.1 Prosigna test and Risk of Recurrence models
The PAM50 assay was also converted into an alternative FDA approved predictive test called Prosigna (Wallden et al., 2015), working on expression data profiled by NanoString nCounter® technology and implementing an alternative Nearest Shrunken Centroid classification (Tibshirani et al., 2002). The Prosigna test is used to define a category of metastatic risk at ten years in women undergoing surgery for invasive BC and focuses on a subset of the PAM50 panel called NANO46. Eventually, the test provides two distinct, but correlated, information: the BC intrinsic subtype and the category of risk of recurrence (low, medium or high). This latter one is precisely differentiated also based on lymph node involvement. The assay exploits a trademarked technology and a proprietary algorithm for breast cancer intrinsic subtyping, having the following peculiarities: • The normalization procedures are specifically designed for expression data collected and processed on NanoString nCounter platforms • The reference included in the Prosigna kit consists of in-vitro transcribed RNA-targets, to be processed together with the sample under study • The classification predicted by Prosigna excludes the Normal-like class, which is instead present in other PAM50-based approaches • The Nearest Shrunken Centroid approach uses Pearson linear correlation as distance metric to assign one of the four intrinsic subtypes.
Then, ROR score of a patient is computed from a weighted sum of NANO46 Pearson correlations to each intrinsic subtype centroid, proliferation score and tumor size parameters, according to the following model: Prosigna predictive model was trained on NanoString profiles over nCounter® platform and its accuracy was confirmed also by subsequent analytical and clinical validation studies (Wallden et al., 2015). However, it is based on the former PAM50-based ROR-C predictor of Parker et al., a Cox regression model that estimates the risk of recurrence score (ROR-C) as a weighted sum of Spearman correlations with subtype centroids and tumor size parameter: − = 0.05 + 0.11 2 − 0.23 + 0.09 + 0.17 Following, we discuss our emulation on TCGA RNA-seq data of a Prosigna-inspired intrinsic subtyping and, eventually, we compare the so-obtained results with the ones found using the original PAM50 approach with the same AWCA reference.

S3.1.1 Data pre-processing and normalization
After discarding Normal-like samples and solving missing value issue as previously described, we focused not only on the PAM50 panel but also over eight housekeeping genes, needed to replicate Prosigna normalization steps. The Prosigna test requires first to divide each PAM50 sample profile by the geometric mean of the housekeeper absolute expression values. Since we worked on log2-space, we computed equivalently the arithmetic mean of the log2-values and we subtracted it from each log2-transformed sample profile. However, in our in-silico emulation on RNA-seq profiles, this normalization is a mere translation into the logarithmic space, which did not bring any advantage. A fixed value subtraction does not affect the Pearson correlation to be computed and not even the subsequent subtyping procedure. Furthermore, all the profiles were already UQ normalized within-sample. Housekeeper normalization is instead required for Prosigna test on the NanoString platform to correct input variability.
Following, we had to calculate the Log2ratios with respect to a reference sample. Although each real Prosigna assay run includes as reference the in vitro transcribed RNAs of all the 58 target genes, to be processed in the NanoString platform together with the sample, in our emulation we were forced to consider a fixed reference. Thus, we used an AWCA-based reference, built averaging in log2-space the gene expressions inside each subtype class and, then, taking the average of these within-class averages, to equate the contributions of the intrinsic subtypes despite their unbalanced proportions. But, in this case, the Normal-like class was a-priori excluded from the AWCA computation. Our reference sample was subtracted from the normalized profiles to find the corresponding Log2ratios. This step is the most relevant for subsequent subtype calls since it provides differential alteration of each gene in each profile under study. Furthermore, in this case, we noticed that any sweep of the reference value would affect greatly the following subtype calls, even more than for the PAM50method, showing the lack of robustness for a Prosigna inspired approach in absence of its in-vitro reference.
Lastly, we computed the z-scores from the Log2ratio values of each sample under study, as required by the Prosigna test to uncouple each gene from the sample mean and variance of a single realization. However, for our subtyping of RNA-seq profiles, this step appeared not meaningful, probably because each profile came from the same already normalized experimental cohort.

S3.1.2 Prosigna classification with the Nearest Shrunken Centroid method
Once all the normalization steps were concluded, we performed the subtyping procedure according to the Prosigna implementation of the Nearest Shrunken Centroid method. Test guidelines specify to calculate Pearson correlations between a sample z-score and each centroid of the four intrinsic subtypes. Prosigna centroids are anyhow disclosed only limited to the NANO46 gene list (a subset of the PAM50 panel). Thus, we faced an additional missing-value problem. At first, we replaced the four missing genes in the Prosigna centroids with the mean normalized expression of each of the 4 genes among all the 792 samples under study. Yet, this choice implied to cancel from the summation the contribution of the standard scores for each of the four genes of interest, as it can be easily seen from the following expression of the Pearson Correlation: In this way, each Pearson coefficient was obtained from the division by n-1 with n equal to 50 because of the 50 genes of the PAM50 panel. Thus, the only difference with respect to completely discarding the gene expression of the four genes without centroid-values was to reduce the magnitude of the obtained correlations. Since Prosigna guidelines indicate NANO46 genes as the only ones significantly implied in clinical outcomes, we narrowed down our analysis over these genes only. For each sample we calculated the Pearson correlations between each z-score vector restricted to the NANO46 subset and all the disclosed Prosigna centroids, as required also for the subsequent computation of the ROR score. Then, each sample was assigned to the subtype for which it had the highest correlation coefficient.

S3.1.3 Prosigna classification of the TCGA dataset
To summarize, dealing with the TCGA dataset we discarded the 25 Normal-like specimens, whereas for the remaining 792 samples we performed the Prosigna subtyping using the average of within-class averages as reference.

S3.2 Conclusions from the Prosigna emulation and Risk of recurrence comparative analysis
Prosigna seemed an appealing candidate to work on RNA-seq data since it is designed for highly sensitive and reproducible NanoString profiles, able to quantify very low RNA concentrations similarly to what RNA-seq does. Nonetheless, we had to exclude Normal-like samples from our in-silico emulation, since the Prosigna test does not include the Normal-like class. After replicating all the normalization steps required by the Prosigna test, we performed the Prosigna implementation of the Nearest Shrunken Centroid subtyping method. However, it is relevant to stress that we were forced to use again the AWCA reference sample, computed within the dataset as the average of within-class averages (AWCA), whereas the real Prosigna test provides an in-vitro reference to be processed with the sample under study directly inside the proprietary platform. Eventually, comparing our Prosigna-emulation subtyping results with the published PAM50 subtype calls by Ciriello et al. (2015), we found an overall concordance of 86% and we noticed a slightly pessimistic prediction trend both with respect to the published subtype calls and to the predictions obtained with the PAM50 method using the same AWCA reference sample, as shown in Figure S11.
This kind of pessimistic trend could be caused by the reference choice or, partially, by how the Prosigna test and its centroids could have been designed, to handle false positive low-risk cases in subsequent clinical outcome predictions. Ultimately it is plausible that Prosigna normalization steps and in-vitro references are all probably intended to favour linearity between Prosigna centroids and each transformed profile under study, allowing the use of Pearson correlation as distance metric, but penalizing the portability of this approach on expression data different from the ones processed on its proprietary platform.
Eventually, we performed also a comparative analysis of the Risk of Recurrence (ROR) scores computed downstream of either the AWCA-based PAM50 classification, or the standard PAM50 technical replica (strictly emulating the published PAM50 classification), or the Prosigna subtyping. ROR scores of PAM50 and Prosigna assays were respectively obtained according to the predictive models presented in S3.1. Then, they were tested against the 10-year overall survival annotations of the TCGA dataset to compare their prognostic ability. Prosigna ROR scores resulted the weakest in distinguishing good and poor long-term clinical outcomes.
Conversely, ROR scores from standard and AWCA-based PAM50 classifications were highly correlated to each other, with AWCA-based ROR scores that had the most statistically significant p-value in discriminating good and poor prognosis cases based on 10-year overall survival annotations. In conclusion, the graphs in Figure S12 show clearly some remarkable differences between the centroid values of the original PAM50 subtypes (in blue) with respect to the Prosigna centroids (in orange). This comparison involves all the NANO46 genes, the subset of PAM50 genes for which Prosigna centroids were made available.
Consequently, we went on with the original PAM50 method and its centroids, moving towards the analysis of the other datasets at our disposal, as described in the main paper. All the relevant results of this wide investigation are reported in Section S4, here below.

S4 PAM50 classifications of the GSE96058 dataset
Two PAM50 classifications of the GSE96058 dataset were performed using the average of within-class average (AWCA) sample references, alternatively including or excluding the Normal-like class, as previously done for the TCGA dataset. This pair of AWCA references were built within the GSE96058 dataset itself according to its published subtype calls, whose distribution is summarized in Figure S13. Both PAM50 classifications of the GSE96058 dataset confirmed the effectiveness of the AWCA references to replicate a PAM50 classification with high accuracy even in the absence of the exactly used reference, as shown in Figure S14. The AWCA references built using the RSEM values of the TCGA dataset were then used for the subtyping of the RSEM PanCA dataset, and the AWCA references obtained from the FPKM values of the GSE96058 dataset for the subtyping of the FPKM GSE81538 profiles ( Figure S15). The experienced accuracies were high, quite comparable both with the accuracies reached in classifying the TCGA and GSE96058 datasets (i.e., the datasets involved in the reference computations) and with the accuracies reached in the classifications of the PanCA and GSE81538 datasets themselves using their inner AWCA references. Notably, the high concordances found with the published subtype calls showed that it is possible to use even an external reference to center RNAseq data for robust, single-sample PAM50 classification, provided that this external reference was built with RNA-seq data subject to the same normalization (RSEM or FPKM) of the data under PAM50-analysis. Furthermore, this hints at the chance of building and validate pre-defined reference samples able to assure future repeatability of classification.
The best performing AWCA references for RSEM and FPKM data (built on TCGA excluding Normal-like class or in GSE96058, including Normal-like samples), are provided in Table S3 here below.  In order to perform breast cancer (BC) intrinsic subtyping, different supervised methods were taken into account and compared. As always in machine learning problems, the "No free lunch theorem" suggested to investigate multiple classification models up to find a learner whose generalization accuracy, tested over unseen samples, appeared to suit properly our specific task. Additionally, we tried alternative feature selection approaches to find a set of genes relevant with respect to our subtyping task, and that could be used as feature space to improve the performances of the machine learning model under evaluation. In this perspective, most of the assessed learners were trained with embedded regularizations.
Embedded regularization methods are often used to learn which features best contribute to the accuracy of the model while the model is being fitted. These canonical feature selection methods introduce additional penalization constraints into the optimization function of a predictive algorithm as to shrink some parameters towards zero, or even push the model toward lower complexity by driving some parameters to zero. Consequently, they reduce variance and risk of overfitting. Practically, if we consider a specific task to be accomplished through a model having as parameter vector and we take as optimization function to be minimized a suitable loss function ( ), we can add a regularization term ( ) to control overfitting, such that the total loss function to be minimized takes the form: where is the regularization hyperparameter in charge of weighing the whole penalization term. Since controls the strength of shrinkage and feature selection, it must be properly tuned.
In case of weight decay (or Ridge, L2-regularization), the penalization term is: It provides a parameter shrinkage by reducing the overall squared Euclidean norm of parameters and, as suggested by the name, in a sequential learning algorithm it encourages parameter values (weights) to decay toward zero. Nonetheless, none parameter and consequently none feature is effectively annulled.
In case of Lasso (or L1-regularization), the penalization term is: where ‖ ‖ 1 is simply the sum of the absolute values of all the model parameters. This regularization, for sufficiently large, drives to zero some parameters leading to a sparse model in which the corresponding variables play no role. Nonetheless, Lasso fails to do grouped selection, tending to select only one variable from a group and ignoring the others if correlated. This can compromise robustness and stability in case of high variability of features. Furthermore, when the number of parameters is bigger than the number of samples, Lasso regularization selects at most n features, where n is equal to the number of samples. Thus, in a genomic dataset, the number of selected genes is bounded by the number of samples, introducing additional bias.
Finally, in case of elastic net (or combined L1-L2-regularization), the penalization term is: where 1 and 2 are the Lasso and Ridge regularization hyperparameters, respectively. This combined regularization overcomes the flaws of the previous ones since the L1-part of the penalty generates a sparse model with effective feature selection, whereas the quadratic L2-part removes the limitation on the number of selected features, encourages grouping effect and stabilizes the L1-regularization path.

S5.2 Classifiers under evaluation
The following subsections summarize the characteristics of the methods we implemented and tested.

S5.2.1 Ensemble methods: Multiclass Decision Forest and Decision Jungle
Ensemble models often provide better coverage and accuracy than a single classifier over complex tasks, combining several learners to manage and improve bias-variance trade-off. In our Decision Forest algorithm, bagging is implemented to train each decision tree in the ensemble using a randomly drawn subset of the training set. Furthermore, each decision tree restricts to a fixed number the features randomly selected for splitting each node. This kind of approach, where the random selection of features is combined with bagging, is known also as "Random forest" from the trademark algorithm of Leo Breiman and Adele Cutler. However, we referred to this approach as multiclass Decision Forest with bagging. The algorithm works by building a classification decision forest with multiple decision trees, such that each tree outputs a non-normalized frequency histogram of labels, and then the most popular output class is obtained by voting. Voting aggregation process sums these histograms and normalizes the result to get the "probability" for each label.
The trees that have high prediction confidence have a greater weight in the final decision of the ensemble to achieve higher classification accuracy. Decision trees can represent non-linear decision boundaries and are non-parametric models, thus the sample population is not required to fit any parametrized distribution. Furthermore, they are efficient in computation and memory usage and perform integrated feature selection that makes them resilient in the presence of noisy features. All these peculiarities are shared with Decision Jungle, used as an alternative bagging-based ensemble method. Unlike conventional decision trees that only allow one path to every node, Decision Jungles are compact and powerful discriminative models for classification, where directed acyclic graphs (DAGs) allow multiple paths from the root to each leaf. Specifically, decision jungles use node merging as well as node splitting algorithms to jointly optimize both the features and the structure of the DAGs efficiently, while the weighted sum of entropies at the leaves is minimized during training. In the end, compared to decision forests, decision jungles require dramatically less memory while, usually, considerably improving generalization.

S5.2.2 Multiclass Logistic Regression
Logistic Regression is a classification method that uses the logistic sigmoid function on a linear combination of the features , weighted by a parameter vector , to estimate the posterior probability of a sample to belonging to a class : This simple approach is thought for binary classification, where the alternative class probability is just the complement of the found probability. In the multiclass version, instead, multiclass Logistic Regression is also known as softmax Logistic Regression, since the posterior probabilities are given by a softmax transformation of the activation functions: ) where each activation = is a linear combination of the features , weighted by a parameter vector , specific for each class . Softmax transformation squeezes toward one the biggest exponential, while pushes toward zero smallest ones. Finally, the predicted class is the one with the highest probability. For an N-dimensional feature space, this model has N adjustable parameters for each class , to be computed through maximum likelihood estimation up to minimizing the following cross-entropy error function for the multivariate case: ( where is the probability that the -th training sample belongs to the class, according to the logistic function estimate; is, instead, the matrix of the true labels, such that is usually null for each class except the class the -th training sample belongs to, and whose value is one. We used this method with a combined L1-L2 regularization (elastic net) and therefore, to train the model, the following additional penalization term was added to the reported loss function.
( ) = 1 ‖ ‖ 1 + 2 ‖ ‖ 2 2 S5.2.3 Multiclass Neural Network A Feed-Forward artificial Neural Network (FFNN) is a non-linear model characterized by the number of neurons (nodes), their topology, their activation functions and the values of synaptic weights and biases. It includes a set of interconnected layers. The inputs are the first layer and are connected to an output layer by an acyclic graph comprised of weighted edges and nodes, i.e., neurons. Multiple hidden layers can be inserted between the input and output layers, but theoretically, having input from a compact set, just a single hidden layer and enough non-linear activation functions could model almost every non-linear function. In fact, a FFNN provides a linear combination of non-linear activation functions for each neuron along the feed-forward path, since all nodes in a layer are connected by the weighted edges to nodes in the next layer ( Figure S16). Let consider inputs, hidden neurons (on a single hidden layer) and outputs, with as activation function for the hidden neurons and as activation function for the output neurons. Let 11 … be the input-hidden layer weights; whereas, let 1 … be the hidden-output layer weights.
The generic output will be: Figure S16. Multi-output Feed-Forward Neural Network. Most predictive tasks can be accomplished easily with at most two hidden layers, whereas deep neural networks with many layers can be very effective in complex tasks, including automated feature extraction steps. However, in our case, we decided to evaluate a fully connected Feed-Forward Neural Network with a single hidden layer and sigmoid activation functions, both for the hidden and output layers. Each output neuron is associated with a single class to evaluate the probability of belonging to that class; then, the class with the highest probability is the predicted one for the classification task. The relationship between inputs and outputs is learned from training the FFNN on normalized input data. The cross-entropy error function is adopted as loss function ( ) with weight decay (L2-regularization), and backpropagation with learning rate and momentum is used as rule to learn iteratively the parameters of the model, i.e., the weights associated with each edge of the network. In particular, the momentum term is added to avoid oscillation and risk of local minima. This means for a generic parameter w that its next value at time k+1 is: Cross-validation is employed as tuning technique for model selection, up to find a proper number of hidden neurons, as well as suitable L2-regularization hyperparameter, learning rate , and momentum . S5.2.4 One-vs-All multiclass Support Vector Machine Support Vector Machines (SVMs) have been extensively used in the classification of gene expression data with thousands of features and limited set of samples, even if they are originally designed for binary classification tasks. In fact, the basic concept behind SVM binary classification for linearly separable data is finding among the infinite linear boundaries (called hyperplanes) that can separate the data, the optimal hyperplane that maximizes the margin from the nearest points of each class. Nevertheless, not every dataset is linearly separable, and SVMs are able to classify also complex and nonlinear data by taking advantage of the "Kernel Trick" to move toward higher dimensional spaces. Specifically, an SVM classifier can use a kernel function to nonlinearly transform the data into a higher dimension in which the problem is reduced to the linear case. Commonly used kernel functions include Polynomial, Gaussian radial basis function (RBF), or sigmoid functions. Hence, when target values are in {-1,1} and is the set of indexes of the support vectors with their associated parameters , the class prediction for an unseen sample is computed as: where ∑ ( , ) + = 0 ∈ is the equation of the hyperplane able to separate the data accurately.
However, selecting the kernel function alongside the parameterization could be challenging during the model selection phase. Furthermore, handling noisy data requires to define a non-perfectly separating hyperplane that anyway minimizes the classification error, without increasing model complexity and introducing the risk of overfitting. Eventually, SVM can be extended to deal with multi-class classification problems as we did, i.e., using one SVM for each class in combination with an approach called One-vs-All. For each SVM, a given class is fitted against the rest of the classes, all combined together. Then, a prediction is performed by running all these binary classifiers and choosing the prediction with the highest confidence score. In our implementation, each SVM includes also Lasso regularization, to reduce the risk of overfitting, performing embedded feature selection.

S5.3 Training phase and classifier survey
For the training phase, the TCGA training set was split into 10 folds, randomly drawn and balanced with respect to subtypes. Given the generic classifier under study, each specific model, tuned with a specific combination of hyperparameters, was trained with 10-fold cross-validation, i.e., 10 times and each time over a different set of 9 folds, leaving the 10 th fold out as validation set. Thus, each already tuned specific model learned its optimal parameter values over 10 slightly different training sets and could estimate its generalization accuracy, each time over a bunch of samples extraneous with respect to that turn of training. We repeated this training phase for each family of classifiers under study and we found the best-trained models, i.e., the models whose hyperparameter setting and learned parameter values led to the best generalization accuracy, estimated through cross-validation. Further details about the machine learning approaches under study are summarized in Table S4, whereas Figures S17-S21 report the performances on the TCGA test set of each assessed besttrained model from each family of classifiers.

S5.4 Feature selection strategies
Starting from the most promising supervised method, i.e., regularized multiclass Logistic Regression, several additional feature selection strategies were assessed, other than simply considering the PAM50 genes already involved in PAM50-based subtyping. The aim was to reduce the feature space of interest compared to the whole set of profiled genes and possibly improve the performance of the learner in performing the intrinsic subtyping task. First, these approaches were applied to the RSEM profiles of the TCGA dataset, for both training and testing. Then, some relevant feature spaces were employed also to train and test the multiclass Logistic Regression on the FPKM data of the GSE96058 dataset. Eventually, both PanCA and GSE81538 datasets were used as external datasets for testing the corresponding models trained respectively on the RSEM profiles of the TCGA dataset or on the FPKM profiles of the GSE96058 dataset. Further details about all the assessed feature selection methods are briefly discussed in the following subsections.

S5.4.1 Blind approach
At first, we tried a typical approach of feature selection for genomic data that works completely blindly with respect to the predictive task and depends only on the magnitudes and dynamicity of the collected expression values. This feature selection approach excluded simply the lowest expressed genes and the genes with the lowest variability, since they often represented less reproducible or potentially less useful features when working across multiple sequencing experiments. Therefore, specifically, we reduced the initial dimensionality of the TCGA dataset by discarding: • Genes whose average expression in 817 samples was in the 1st quartile of the distribution of the mean expressions of all the sequenced genes (exclusion of low-expressing genes) • Genes whose standard deviation, compared to the mean value of the expressions of the gene in the dataset, fell within the 1st quartile of the standard deviations of all the sequenced genes (exclusion of genes with low variability).
After this feature selection step, the genes for each sample were almost halved, and hence the training set had 10,606 features (instead of 19,737). But after tuning and training (done exactly as for previous analysis) a regularized multiclass Logistic Regression over this reduced training set, the cross-validated accuracy was lower than the previous estimate. Then, we assessed that also over the test set the new best-trained model behaved worst, with an accuracy around 0.81 with respect to the accuracy of 0.85 experienced for the best model trained on the entire original feature space. Hence, this choice led to disappointing results. Indeed, the high sensitivity and accuracy of the RNA-sequencing quantitative measurements could make relevant for our subtyping task also low-expressed genes, excluded from classification methods based on the expertise gained on not up-to-date sequencing technologies.

S5.4.2 Filter based approaches
Among the assessed external strategies, not involving Logistic Regression during the feature selection phase, we considered several filter-based methods, since they are effective in computation time and robust to overfitting. Filter-based methods use a statistical measure to assign scores to features, which are then ranked accordingly and either selected to be kept or removed from the feature space. In supervised tasks, each feature is evaluated with respect to the target to be predicted, but regardless of the chosen model and without considering relationships among features, with some redundancy risk. However, we filtered out a certain number of features supposed to be less meaningful, or even irrelevant, for our subtyping task according to the next scoring metrics: 1) Fisher scores; 2) Mutual Information; 3) Chi-squared scores; 4) Spearman Correlation. In detail, we proceeded with the following workflow for each scoring metric. We started from the complete TCGA training set and we collected 5 wide random samplings with replacement, stratified with respect to subtypes. For each sampling, we used the given metric to score all the 19,737 original genes independently from the others, known that the score of each gene estimates its relationship with the Ciriello et al. (2015) target subtype to be predicted. Hence, for each scoring metric we obtained 5 rankings (one for each sampling) and, consequently, 5 scores for each gene with the given metric. The next step was simply to sum, in a single overall gene score, the 5 scores of each gene within a given metric and then to make the overall ranking for that given metric. Notice that the overall ranking of each scoring metric is independent of the others. Using an overall ranking improved the robustness of the genes that emerged as the most relevant ones for a given metric since first positions include only genes meaningful across all the 5 cases scored with that given metric. Following, we used as alternative reduced feature spaces the top 1000 genes of each overall ranking (i.e., scoring metric), since the non-null weights (then non-null features) learned by the regularized multiclass Logistic Regression trained on the complete TCGA training set of 19,737 features/genes were approximately one thousand.

S5.4.3 DEG-based approach: limma analysis
We analyzed the whole original feature space of 19,737 genes, using limma, an R package for the analysis of gene expression data (from microarray or RNA-seq), whose core capability is the use of linear models to assess differential expression in the context of multifactor designed experiments. Specifically, by setting only the Ciriello et al. (2015) target subtype as variable in the experimental design, we noticed that the Basal class dominates because its expression profiles are more easily recognizable and different from those of the other classes, even more than the profiles of the other subtypes are different from each others. Thus, using the F statistic to rank the genes would favor mainly the distinction of Basal class from the other subtypes. To overcome this issue, we defined all the 10 possible contrasts between the 5 Breast Cancer subtypes, making pairwise differential analyses. Then, fixed an integer N, we selected genes according to the following criterion: for each contrast (i.e., for each pair of subtypes) given M genes differentially expressed according to the Sequencing Quality Control Consortium (SEQC) criterion (SEQC Consortium, 2014) (i.e., | log(Fold Change) | > 1 and p-value < 0.01), the top N genes according to the p-value were chosen if M>N, otherwise all M differentially expressed genes were selected. In this way, we obtained 10 lists, each one including the top N genes (or at least all the M genes) differentially expressed in a given pairwise contrast. For each considered choice of N, through the union of the corresponding 10 lists, we gained a complete set of genes to be selected as feature space to train our regularized logistic regression. This criterion guaranteed that each possible socalled limmaN feature space certainly includes the first N (or all the M) genes emerged as differentially expressed genes for each contrast, and therefore relevant to distinguish at least a couple of subtypes. Notice that as previously we tuned and trained our regularized Logistic Regression on the alternative feature spaces obtained for 11 different N values, ranging from 10 to 1000 (i.e., 10,30,50,60,70,80,90,100,200, 500, 1000).

S5.4.4 Feature selection strategy with a wrapper method
As the last step of our study, we assessed if the prediction performances of the regularized multiclass Logistic Regression could be further improved using additionally a wrapper method. Wrapper methods consider feature selection as a search problem where different combinations of features are evaluated and compared based on the performances of the chosen model, detecting also possible feature interactions to avoid redundancy. However, the main limitation of these methods is the high computational cost, both in terms of time and memory, that made their use computationally unaffordable in case of wide feature space dimensionalities, like the ones of the RNA-seq datasets. Therefore, we implemented a strategy involving a wrapper method, but alternatively using as starting feature space some promising reduced feature spaces, like the ones previously obtained from Chi-squared based or DEG-based filtering. Sequential backward elimination was chosen as heuristic method to find reduced subsets of relevant features, since it appeared to us less prone to underfitting with respect to a forward selection of features, considering what we experienced with multiple assessments of the same task performed over smaller feature spaces of proof. The backward elimination algorithm started estimating the performances of the learner trained over the N-dimensionality whole-features training set. Then, it considers and estimates the accuracy of all the subsets with N-1 features.
After choosing the subset whose estimated accuracy is the highest, it excludes accordingly the feature not belonging to that subset. In the case of equality, the elimination choice is sequential. Thus, the method iteratively discards one gene at a time until no more feature elimination improves the accuracy of the regularized multiclass Logistic Regression beyond a fixed threshold of gain. In our study, first, we applied our wrapper method over the top 1000 Chi-squared based genes; although multiple runs were unfeasible for computational reasons, this still raw approach, combining regularized logistic regression, filter-based feature selection and the additional backward elimination, appeared worthy of further investigations, with a generalization accuracy estimated with 5-folds cross-validation of about 0.95. Therefore, also considering that the number of genes preserved by this first trial was about one hundred, we chose as starting feature spaces for other attempts of the combined strategy the top 200 and top 500 Chi-squared-based genes and our limma50 signature. For each alternative starting feature space, we carried out in parallel ten independent runs of backward elimination, performing each run with randomized feature order as to mitigate the bias introduced by the sequential gene scrolling (solvable only through an unfeasible exhaustive search). Since kept genes in each run were not robust (also because of the needed feature shuffling), eventually we combined all the genes kept in at least one run, with a sort of downstream preservation strategy, to face the lack of robustness experienced from multiple runs, while trying to increase the subtyping capability of the collected subsets. Each gene signature preserved from the three corresponding starting feature spaces was used in its turn as a reduced feature space.

Implementation of the wrapper method
To implement the strategy involving a wrapper method, we could not customize the Azure Machine Learning Studio workflow with R-scripts, due to the complexity of the investigation. Therefore, we run on a server the R code needed to implement the wrapper method, extracting data of interest through the AzureML library for R. Furthermore, we took advantage of several functions provided by LiblineaR, a package for the estimation of predictive regularized linear models for classification and regression, and by mlr, a complete framework for machine learning experiments in R. We used the function makeFeatSelWrapper from the mlr library, providing as parameters: 1) a regularized multiclass Logistic Regression as learner for the subtyping task (defined using LiblineaR library functions); 2) 5-fold cross-validation as approach for the evaluation of the performances over investigated subsets of features; 3) the overall accuracy as the measure for performance assessment; and 4) the feature selection strategy. This latter one was specified through the makeFeatSelControlSequential function together with its parameter indicating the choice of a backward elimination strategy. According to backward elimination, the feature selection algorithm starts estimating the performances of our learner trained over the whole-feature training set, i.e., the top N genes, and iteratively discards one gene at a time, provided that our learner, trained over a subset excluding that gene, shows higher estimated accuracy.

S6 Performances of the main Logistic Regression models under evaluation
To ease the access and comprehension of the relevant collected results, all the confusion matrices concerning different regularized multiclass Logistic Regression (mLR) models trained on the TCGA training set, or on the GSE96058 training set, and evaluated on the available inner and external test sets (PanCa and GSE81538) are reported here below in Figures S22-S36, together with some other relevant plots. Eventually , Tables S5 and S6 summarize the performances reached in cross-validation and internal testing, with mLRs using the most relevant emerged signatures as feature spaces and meant respectively for RSEM or FPKM data.
Notice that for each confusion matrix it is clearly specified: the training set on which the model was trained, the feature space of interest, for which kind of data it is intended (RSEM or FPKM), and the test set on which the reported subtyping results were evaluated. The feature spaces of interest moved from the whole sets of profiled genes, to relevant signatures found with our feature selection study. Additionally, as mentioned in the main paper, we used also the already known PAM50 genes as feature space for mLRs; so-obtained results were useful both to be compared with the AWCA-based PAM50 classifications and as a benchmark to better interpret and evaluate the performances reached with our alternative gene signatures of interest.
Although regularized mLRs reached slightly lower accuracies compared to the here proposed AWCA version (which includes the Normal-like class) of the PAM50 method, this latter one is biased by the PAM50 nature of the published subtypes, assigned via PAM50 assays and thus using the same genes of interest and the same original centroids developed by Parker et al. (2009). Using PAM50 genes with mLR models brought valuable performances, despite the emerged ambiguity of the PAM50 labels them-selves, and set benchmark results, considering the same signature involved in the original PAM50 method. Eventually, even if slightly inferior, the performances of the limma50 and limma50_BWE based regularized mLRs are also interesting and worthy of further investigations and evaluations.
Particularly, in Table S8 note that, for both the limma50 and limma50_BWE based mLRs intended for RSEM gene expression data, the accuracies reached on the internal TCGA test set and on the corresponding PanCa dataset used for external testing are uniform; the couple of mLR classifiers developed for FPKM data showed instead higher results overall, with slightly lower accuracies in the external GSE81538 test set than in the internal GSE96058 test set. This only marginal degradation on the unknown and independent samples of the external test set is usual for a good classifier, which is able anyway to tackle the overfitting risk during training. Eventually, in each of the mLRs, we experienced meaningful and quite comparable performances, which confirm the quality of single-sample intrinsic subtype callers developed with regularized Logistic Regression models and their reliability when dealing with unknown samples.
Notably, at https://github.com/DEIB-GECO/BC_Intrinsic_subtyping, we provide the R code needed to use the limma50 and limma50_BWE based classifiers available to classify external samples subject to RSEM or FPKM normalization. S7.1 Robustness and concordance evaluation between main intrinsic subtyping approaches Until here, we used mainly the accuracy with respect to the published calls to evaluate our subtyping approaches; nonetheless, although these concordances with the published calls show classification stability, also other assessment must be highlighted to demonstrate more clearly the robustness of the here proposed single-sample approaches, i.e., the AWCA-based PAM50 classification and the most promising mLR classifiers. Particularly, Table S9 clearly highlights the increased robustness of the AWCA-based PAM50, compared with the standard one; conversely, for the mLR approaches, which are perfectly repeatable, Table S10 reports the mean pairwise concordance of classifications provided by mLR PAM50, limma50 and limma50_BWE among them and with the AWCA-based PAM50 one in each internal/external test set. The three mLR classifiers provided high classification robustness in each testing set, despite they consider different feature spaces. Moreover, comparing these three mLR-based classifications with the AWCA-based PAM50 subtyping, we found several cases of full agreement, but discordant with the published subtypes; hence, mLRs appear robust also in classifying ambiguous cases, despite published labels were used for their training. Furthermore, mLRs may be further enhanced in the future through a greater amount of training samples and more reliable labels, as the ones from AWCA-based subtyping.

S7.2 Prognostic assessment
Eventually, we performed further analyses to evaluate the improved or reduced ability of the proposed classifiers to identify cases with better/worse prognosis, considering the 10-year overall survival annotations available for the datasets at our disposal. Specifically, we considered that each subtype call discordance between Luminal A and another subtype (or vice versa) implies a different expected prognosis. Figure S37 shows the increased prognostic ability of our single-sample classifiers in such prognosis-related discordant cases compared with the published calls.  Thus, in case of discordances, probably due to the already mentioned ambiguity of subtype calls for samples of more difficult attribution, these clinical assessments provided an unbiased criterion to evaluate alternative classifications. Particularly, the emerged results confirmed the reliability of the approaches proposed in this work, in the light of their better capability of recognizing good and poor long-term clinical outcomes.
Table S12 summarizes results found in using AWCA references built on RSEM data as external AWCA references for PAM50 classification of FPKM data and vice versa. Comparing these results with those in Table  S8 for the AWCA reference with the same normalization of the classified data, we notice the decreased stability of these so-obtained classifications. Thus, for the single-sample AWCA-based PAM50 method we strongly encourage the use of the corresponding normalized AWCA reference, as not to undermine the robustness gain provided by the AWCA approach. Such coherence becomes crucial for the already trained logistic regression models, as clearly exemplified in Table S13; we also evaluated strategies to scale and/or transform the differently normalized public data at our disposal, but performances did not improve, probably due to the introduced approximations and bias.  AWCA-based PAM50: an additional use-case on microarray data We tested the AWCA-based PAM50 method on microarray expression data from Affymetrix U133A-B chips. Our test was easily performed using the R code we made available on GitHub (https://github.com/DEIB-GECO/BC_Intrinsic_subtyping); consequently, we provide also a useful conversion table to trace PAM50 genes in U133A-B and also in U133 PLUS 2.0 GeneChips. For this additional analysis, we used two public GEO dataset (GSE4922 and GSE1456). For both datasets, raw data were normalized by using the global mean method. Probe-set signal values were natural log transformed and scaled by adjusting the mean intensity to a target signal value. We simply converted natural log values into log2-transfored values and selected for each PAM50 gene the most specific available corresponding probe. In case of ties, we selected the probe bringing the highest mean signal and variance. The GSE4922 dataset includes 249 samples, without any subtype annotation, and is unbalanced in terms of ER status distribution (211 ER+/34 ER-). Therefore, we compared AWCA-based classifications with corresponding standard PAM50 classifications using 60% ER+/40% ER-subsets of 50 samples. Concordance evaluations show that our AWCA-based approach overcomes the standard one, with mean concordance of 96% versus 88% and lower standard deviation of only 2.0% versus 5.7%. The GSE1456 dataset includes 139 samples, 119 of which are assigned with intrinsic subtype. Nonetheless, no ER status annotation is available to repeat standard PAM50 classifications with corresponding AWCA-based PAM50 classifications. Therefore, we exported an AWCA reference built in the GSE4922 dataset to the GSE1456 dataset to be used as external reference for subtyping of GSE1456 dataset. This brought lower concordance with respect to the original calls (80/119); nonetheless, from a careful evaluation of our subtyping results against the available 7 year disease free survival annotations, we found that our calls were much more reliable in predicting medium-long clinical outcome than the published calls. Indeed, in all cases of discordances implying changes in expected prognoses, the original classification correctly predicts only 11 cases, whereas our external AWCA-based subtyping provides 37 prognostic reliable subtype calls. Thus, for ambiguous/discordant samples AWCA-based PAM50 resulted to be more than 3 times better in recognizing good or poor prognoses within 7 years. Overall, both internal and external AWCA-based PAM50 classifications provided higher stability and reliability of subtyping, as well as a substantial increase of prognostic ability, compared with the standard PAM50, confirming the strength of our AWCA approach also on gene expression data from microarray technology.