Accurate prediction of breast cancer survival through coherent voting networks with gene expression profiling

For a patient affected by breast cancer, after tumor removal, it is necessary to decide which adjuvant therapy is able to prevent tumor relapse and formation of metastases. A prediction of the outcome of adjuvant therapy tailored for the patient is hard, due to the heterogeneous nature of the disease. We devised a methodology for predicting 5-years survival based on the new machine learning paradigm of coherent voting networks, with improved accuracy over state-of-the-art prediction methods. The ’coherent voting communities’ metaphor provides a certificate justifying the survival prediction for an individual patient, thus facilitating its acceptability in practice, in the vein of explainable Artificial Intelligence. The method we propose is quite flexible and applicable to other types of cancer.

2 Overlap of our fingerprints with previously published multi-gene fingerprints A comparison of our gene panel and the known 33 multigene panels listed in [1] shows a minimal overlap of our fingerprints with previously described fingerprints for breast cancer. We have 1 overlap over 70 genes for Mammaprint fingerprint (Mamma), and 1 overlap for the 16 genes in the Oncotype DX fingerprint (RS).
Therapy class gene Fingerprints

Features of independent cohorts
Many BC data set are available on the NCBI GEO repository. We have selected 4 data sets using the following criteria. The data set should have a sufficient number of patients (at least 30 patients overall). The data set should have survival data (preferably Overalls survival) recorded explicitly for each patient. The data set should have an explicit adjuvant therapy record for each patient, or the same treatment should be declared common to all patients in the data set. We excluded data set for which the treatment record for each patient could be inferred in principle from other records (e.g. hormonal status) but are not declared explicitly. Data from TCGA (The Cancer Genome Atlas) is not used since the treatment status is known only for few patients, and the TCGA (https://www.cancer.gov/about-n ci/organization/ccg/research/structural-genomics/tcga) records do not follow the same taxonomy of the Metabric records.

Measures of performance
Evaluation of prediction performance. For a set of patients whose survival data is known, and a prediction has been computed we have a confusion matrix with entries: 1. TP is the number of patients with actual short survival whose prediction is short survival.
2. TN is the number of patients with actual long survival whose prediction is long survival.
3. FP is the number of patients with actual long survival whose prediction is short survival.
4. FN is the number of patients with actual short survival whose prediction is long survival.
5. NR is the number of patients for which the method does not produce a prediction (no response).
Note that patients for which no prediction is given are not counted in any of the four categories standard categories (TP, TN, FP, FN), thus the number of "no response" (NR) is a further performance parameter to be considered. For the total number of patients examined T OT it hods: The following performance functions are considered: (Relative) Accuracy (Acc) In situations where the number of patients is fixed and the number of "no answers" is within a small range, we can use also use directly the absolute accuracy (TP+TN) as a figure of merit for ranking solutions. Positive predictive value (PPV) Note that this formulation is always well defined even when FP or FN is zero. For OR we also report the 95% confidence interval and the corresponding Fisher's exact test p-value. Cohen's Kappa (κ).
where p o is the observed relative accuracy of the method, and p e is the expected relative accuracy of a random predictor that uses the marginal probabilities. Specifically, Note that the value is undefined when p e = 1, but this case can be attained only trivially when the set of patients has only one class of survival. Cohen's Kappa when positive measures the gain in performance of the proposed classifier against the performance of a random classifier using the marginal probabilities, as measured by its expected accuracy. We give her details of the methodology for building, training, and using a novel classifier methodology, which we name "Coherent Voting Networks". For concreteness, we describe the technique in the context of the prognostic problem for post-surgery breast cancer patients although its description and scope can be cast in more generic terms. Given a set of n patients P = {p 1 , ..p n }, and a set of k genes (mRNA) G = {g 1 , ...g k }, we have at our disposal gene expression measurements (typically via high throughput transcriptomics) for each gene and in each patient organized as a n × k matrix M (P, G). For simplicity, we assume a full matrix, however imputation of missing values in the matrix is not needed, as missing entries are just missing edges in the graph representation of the matrix. For each patient we have also the survival function S : P → {low-risk, high-risk} where high-risk indicates survival below five years, and low-risk indicates survival above five years. The classification problem we solve is as follows: use the matrix M and the survival function S to train a classifier, so that for a new patient p for which gene expression levels for a fingerprint G ⊂ G of the genes is known, we can infer the value of the survival function S(p ) with high accuracy. Our objective is to obtain simultaneously (a) a fingerprint G of small size, and (b) a good performance of the classifier as measured in a train-validation-test evaluation pipeline. A train-validation-test, although different from ours, is used as an evaluation protocol of the Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge (BCC) conducted between July 2012 and December 2012.

Coherent voting networks
The new classification method relies on the hypothesis that the input data (P ,G,M ,S) admits a combinatorial construction called a "coherent voting network", such construction is in general not unique, thus among the possible constructions we will perform several optimizations and filtering steps for attaining goals (a) and (b). Definition of a voting network. A "voting network" is composed of (I) a collection C of subsets of P , composed of a first level voting function F 1 that maps any restriction of S to a value in {low-risk, high-risk, ⊥}, where ⊥ stands for an undefined label, and a secondary voting function F 2 , applied to the output of F 1 . For a patient p ∈ P , let C(p) be the sub-collection of sets in C that contain p (i.e. C(p) = {C ∈ C|p ∈ C}). Given p, for each set C ∈ C(p), the voting function F 1 is evaluated on the restriction of S on C \ {p} (we do not consider p in the input as a patient cannot vote for herself).
The prediction P r(p) for p is obtained with the secondary voting function F 2 , applied to the values of F 1 (C) for all C ∈ C(p).
The simplest implementation has both F 1 and F 2 as the majority function (returning ⊥ in case of tie). However, more complex voting schemes will be discussed below.
The voting network (C, F ) is coherent for p if P r(p) = S(p). We have a fully coherent voting network when (C, F ) is coherent for each p ∈ P . We have a α-coherent voting network when (C, F ) is coherent for at least a fraction α of the |P | patients. The reason why this construction is of interest is that, since the vote for a patient p does not depend on S(p), we can take P r(p) as the prediction for p when S(p) is not know (or not disclosed to the algorithm). A voting network that is coherent, for a high value of α, on its labeled nodes has a good chance of being coherent also for its unlabeled nodes (there will be only one unlabeled node only at any time).

Construction of a Voting network from the input matrix
The first task is to remove from the matrix M genes that do not have sufficient discriminative power since for these genes random fluctuations and experimental noise (technical or biological) cover the main signal we wish to exploit. Initial statistical filter on the Gene set. For a gene g i in G consider the vector V i (low-risk) of entries of M (g i , p) when S(p) = low-risk, and V i (high-risk) of entries of M (g i , p) when S(p) = high-risk. We apply to V i (high-risk) and V i (low-risk) a combination of three elementary statistical tests (t-test, Kolmogorov-Smirnov test, and Mann-Whitney U test), and a fold change test, with given fold change and p-value thresholds. Genes passing the tests are retained in the filtered set G ⊂ G and a reduced matrix M (P, G ) is obtained. This phase requires the setting of a few parameters, including the type of test performed, the maximum p-value for accepting the test, and the threshold for the accepted on the fold change. Here techniques for Differential Gene Expression analysis could be used, such as ANOVA, or tools like DSeq2 [2], edgeR [3]. However, this initial screening phase is in our intention quite light, as subsequent phases should take care of further refining the gene set to attain the final fingerprint.
Gene expression level quantization. Gene expression measurements have a large dynamic range of values, and it is unlikely that two patients have the same expression level of a gene, also due to measuring errors. To define a discrete structure, we quantize the dynamic range of values of each gene into intervals and we replace each entry in M with the corresponding interval. Intervals are induced by cut-points on the real line. We set a minimum and a maximum number of cut points, and a minimum and maximum percentage of patients in each partition induced by the cut-point. We recursively apply cut-point selection methods derived from [4,5] until the limit of cut points is reached, or some internal stopping criterion is reached. This phase requires the setting of a few parameters, including the type of quantization objective function, the minimum and thee maximum number of cut points selected, and the minimum and maximum percentage of values in either interval generated by a split point, and the number of significant digits to be considered in the gene expression measure-ments. Also, such quantization functions need to be able to handle patients for which the survival function is undefined. Patient-gene-interval bipartite graph. After the quantization phase, each entry of the matrix M (P, G) is a pair (gene, interval on the real line). We thus build a bipartite graph in which there is a first class of vertices V P , labeled with the patient's id. The second class of vertices is V G,I where we have a node labeled with a pair of identifier: the first identifier indicates a gene in G, the second identifier is an interval I. We set an arc from node p ∈ V P to node (g, i) ∈ V G,I if and only if M [p, g] = i. We call this bipartite graph BG(P, G, I).
Dense Community detection in bipartite graphs. Similarly to [6], we define a partial dense cover of radius 2 for the bipartite graph G(V 1 , V 2 ) = BG(P, G, I). A partial dense cover is a collection of dense subgraphs of G with density above a threshold δ and with a number of nodes of type 1 above a minimum threshold q 1 and a number of nodes of type 2 above a minimum threshold q 2 . We compute this dense cover using a variant of the Core & peel algorithm in [6]. The Core & peel algorithm is described in [6] for a graph G, thus we adapt it to handle properly bipartite graph , by changing accordingly the notion of density, and imposing a that a set in the cover has a minimum number of nodes in class V 1 and a minimum number of nodes in class V 2 . In our application we will have V 1 = V P and V 2 = V G,I in the bipartite graph BG(P, G, I) Note that each set of vertices in the cover will have both patients and intervals-gene nodes. Let D = {D 1 , .., D m } be this dense cover. This phase requires the setting of the density threshold (as the percentage of required edges over the maximum possible number of edges on a bipartite complete graph with the same nodes. Typical values range from 0.6 to 0.9. as well as minimum thresholds q 1 and q 2 (usually set to 3). Determination of a voting network from the partial dense cover. We can obtain from D a voting network by removing from each D i ∈ D the non-patient nodes (equivalently, projecting the sets in D on the patient component).

Sparsification of a partial dense cover
Community sparsification. Our objective in this phase of the algorithm is to obtain from BG a bipartite graph BG for which we have partial dense cover D which is similar to D for the patient component, but uses a much smaller set G of genes. Thus we obtain from D a gene set system GSS by removing from each set in D the patient-nodes and removing the interval component of the remaining nodes. Next we select a covering threshold c and apply a greedy set multi-cover algorithm [7] to GSS, this algorithm will produce a set of genes G with the property that for each g ∈ GSS, either g ⊆ G , or |g ∩ G | ≥ c. Moreover, it has been proved that the number of genes in |G | is within a logarithmic factor of the smallest subset of G with this property. This phase requires the setting of one main parameter that is the covering number c, typically will be a small number chosen between 3 and 7.
Reduced Patient-gene expression interval bipartite graph. We now compute a reduced bipartite graph BG(P, G , I) using the previously described procedure on the reduced set G , and we re-apply the community detection procedure to obtain the dense cover D and the corresponding voting network. We then test the coherence of the voting network to assess its property of being a α-coherent voting network (for one of the voting functions). If this is verified this voting network can be used for classification purposes.
How to turn a α-coherent voting network into a classifier. Intuitively, we assume that the algorithm A that produces (C, F ) starting from (P ,G,M ,S) and a fixed vector of parameters is sufficiently stable under small perturbations of the input. We apply the same construction to (P ∪ {p}, G , M (P ∪ {p}, G ), S ∪ {p → ⊥}). When we control the coherence of the resulting output (C , F ), the level of coherence α' can be computed for all patients in P . We check that α ≈ α to accept P r(p) as the prediction for the new patient p, based on the coherence of the network on the primitive set of patients.

Extending the repertoire of voting schemes
The primary voting function F 1 is chosen among one of the following standard voting functions: a) "unanimity": the voting function returns label X ∈ {low-risk, high-risk} if there is at least one patient with label X and all patient for which the label is not undefined have the same label X.
b) "qualified majority at threshold t: c) "simple majority": the voting function returns label X ∈ {low-risk, high-risk} if there is at least one patient with label X and the number of patients with the same label X is larger than for the number of patients with any other defined label. d) "more than expected": the voting function returns label X ∈ {low-risk, high-risk} if the number of patients with label X is larger than the expected number of patients with label X in a random sample of patients from P , of the same size, with p-value p ≤ 0.05. This voting function is important when the Training data has unbalanced classes. e) given the survival function S : P → {low-risk, high-risk} we compute a weighting function on the elements of {low-risk, high-risk} by counting the number of patients with a given label. Formally the wight of label X is |{p ∈ P |S(p) = X}| The secondary voting function F 2 is chosen among: f) "simple majority", when the primary voting function is one of a), b), c), d).
g) "weighted majority": when the primary voting function is e), we sum the weights for each label, and return the label with the highest cumulative weight.
h) "unanimity plus weighted majority": We apply the simple majority on the primary vote done with the "unanimity" function, if the result is defined we return it, otherwise we apply the "weighted majority" and we return this value.
This phase requires the setting of one main parameter that is the voting policy.

Method in detail: Training and optimization of Coherent Voting Networks
Train-validation-test evaluation pipeline. The cohort METABTRIC of roughly 2000 patients is split randomly into three sets with proportions (1/2, 1/4, 1/4): training set (roughly 1000 patients), validating set (roughly 500 patients, and testing set (roughly 500 patients). We remove patients for which the survival class cannot be determined. We further stratify each set of patients by the available therapy information. The adjuvant therapy information is whether the patient has received endocrine therapy (YES/NO), radiation therapy (YES/NO), or chemotherapy (YES/NO). This stratification produces eight subclasses of patients. In each subclass of patients, we check the ratio of the size of the two survival classes {high-risk, low-risk}. When the ratio of the largest to the smallest class was greater than 2.5, we use the equalized version of the data. The train-validation pipeline is applied for each class separately on the corresponding three sets of patients. Let us denote the tree sets of patients for a class as T rain, V al, and T est.
We aim at finding a small fingerprint, optimizing the CVN hyperparameters, and evaluating the performance of the chosen classifier. A configuration C is a pair (G , P ar) where G is a set of genes (as obtained by the sparsification procedure) and P ar is a vector of parameters that are given as part of the input to the several phases of the algorithm. Each field of P ar takes values in a domain of small discrete size. The vector space of all possible parameters is denoted by P AR. Generation of fingerprints. For a given fixed vector of parameters P ar ∈ P AR, we obtain from the input (T rain, G, M (T rain, G), S(T rain) a voting network and a panel of genes G after sparsification. Repeating this operation for all the vectors of parameters in vector space of parameters (P AR) we obtain a pool of candidate gene panels G (and corresponding vector of parameters and voting networks).
Post-training selection of fingerprints. We aim at reducing the number of candidate gene panels by retaining only a few high-quality candidate gene panels. We discard a candidate panel G if it is too large (we apply a cutoff value of 20 genes). We discard a candidate panel G if the quality parameter (α) is below a factor 0.85 of the top value for α among all the generated voting networks. We sort the so filtered fingerprints by their value of α for the associated voting network and we take at most the top ranking 30 fingerprints, after removing eventual duplicate candidate panels. Validation and Testing. In this phase, we aim at finding a configuration (fingerprint and parameters) the balances the performance on the LOOCV (leave-one-out-cross-validation) d) This reduced set is then re-split into Pareto strata but the point associated with a configuration is now based on the LOOCV of Train data of the corresponding configurations and each stratum is sorted by the LOOCV score value. In a variant of this phase, the points may be constructed as coordinate-wise arithmetic mean (or geometric mean) of the LOOCV of Train points, and the corresponding points for the Validation data.
e) In the order of the strata and the internal order of each stratum, we check the stopping criterion. The number of these checks is the lookahead number since the stopping criterion involves both Validation and Test data.
f) We stop the lookahead procedure if the Test score value is larger than the Validation score value, or if the Test score value is smaller but within a relative displacement of less than 0.2 from the Validation score value.
or the stopping criterion in f) does not succeed. For these cases, we resort to a manual selection, by choosing via visual inspection a pool of high-performance configurations on the Validation data. We then regularize this set of configurations by taking the Cartesian product of the projections of the configurations on its components. This new set of configurations is then checked on the Test data, and the best performing result on test data is reported. Note that, in this case, the lookahead number is relatively large, in the order of a few tens. However such manually selected fingerprints are of interest since they did perform well on some of the independent sets in LOOCV tests.

Avoiding overfitting
When using minimization of cross-validation error to perform hyper-parameter optimization (hypothesis selection), we may incur in a form of "overfitting". With a large number of hypothesis and in the presence of noisy data, the minimum of cross-validation error may be attained by fitting the noise in the data, rather than the underlying signal [8]. This phenomenon is discussed by A. Y. Ng [8] who also proposes a principled way of choosing an alternative (non-minimal) hypothesis from the cross-validation error data that has more chances of a better generalization error. Here we do not implement the method by A. Y.
Ng, but we observe that in our specific problem a hypothesis that balances well the crossvalidation error and the generalization error is found very close to the top of the ranking we build, so that it can be identified with just a few lookahead operations.