Generating highly accurate prediction hypotheses through collaborative ensemble learning

Ensemble generation is a natural and convenient way of achieving better generalization performance of learning algorithms by gathering their predictive capabilities. Here, we nurture the idea of ensemble-based learning by combining bagging and boosting for the purpose of binary classification. Since the former improves stability through variance reduction, while the latter ameliorates overfitting, the outcome of a multi-model that combines both strives toward a comprehensive net-balancing of the bias-variance trade-off. To further improve this, we alter the bagged-boosting scheme by introducing collaboration between the multi-model’s constituent learners at various levels. This novel stability-guided classification scheme is delivered in two flavours: during or after the boosting process. Applied among a crowd of Gentle Boost ensembles, the ability of the two suggested algorithms to generalize is inspected by comparing them against Subbagging and Gentle Boost on various real-world datasets. In both cases, our models obtained a 40% generalization error decrease. But their true ability to capture details in data was revealed through their application for protein detection in texture analysis of gel electrophoresis images. They achieve improved performance of approximately 0.9773 AUROC when compared to the AUROC of 0.9574 obtained by an SVM based on recursive feature elimination.


Subset size constraint
The size of each generated subset must correspond to the predefined fraction of the original training set, denoted by η. According to this, this constraint can be represented as where η ∈ (0, 1]. Note that when η → 1, overlaps between some of the subsets may occur. More specifically, if η = 1, each X ( j) will represent an identical copy of the original training set. Classifiers trained on such subsets will output the same hypotheses, thus forming a totally ineffective ensemble which essentially performs identically like one of its constituent classifiers. Moreover, if instances are allowed to be duplicated within a subset, then for a large value of N the subsets |X ( j) | become bootstrap samples and the whole process refers to a bootstrap sampling process rather than to an independent one. On the contrary, when η is significantly small, some instances may not be contained in any subset, thus resulting in loss of training information. Therefore, in some cases η plays a crucial role in the process of generating subsets which contain sufficient amount of data needed for the training of each boosting ensemble.

Class distribution constraint
According to this constraint, the class distribution of X ( j) must be preserved in accordance with the one of the original training set X , i.e., where 1 A is an indicator function which returns 0 or 1 depending on whether the event A is an impossible event or a sure one, respectively. It is implied that this must be met for all j = 1, . . . , S. The need of introducing this constraint lies in the possibility of some ensembles to become biased towards instances from a certain class. By just sampling S times without replacement there is a possibility that some instances will occur in several subsets. Moreover, if the sampling is done totally random and without any restrictions, then the class distribution of some subsets might turn out to be highly homogeneous. This is usually the case when the class distribution in the original training set is imbalanced. Therefore, this constraint will not allow a subset whose class distribution is not proportional to the one in X to be generated. In other words, each X ( j) will represent a stratified random sample from the empirical distribution of X .
takes all instances into account, both correctly classified and misclassified, instead of only the latter, as in AdaBoost, which previously introduced a risk of overfitting by solely focusing on the weighted, raw misclassification rate. The WSSE also implies that all predictions made by the weak learners have a certain degree of confidence since the output is not discrete, but rather real, which makes the algorithm more flexible and intuitive. Empirical evidence by Friedman has suggested that Gentle Boost, as a more conservative algorithm, has similar performance to both the Real AdaBoost and Logit Boost algorithms, and often outperforms them, especially when stability is an issue S2 .

4:
for t = 1, 2, . . . , T do 5: Fit the regression function f t,X (x x x) by weighted least-squares of each y i to x x x i with weights w i,t .

6:
Update F X (x x x) ← F X (x x x) + f t,X (x x x).

7:
Update the weights w i,t+1 ← w i,t exp [−y i f t,X (x x x i )]/Z t , where Z t is a normalization constant that makes ∑ N i=1 w i,t+1 = 1. Output the final classifier sign [F X (x x x)] = ∑ T t=1 f t,X (x x x) † 10: end procedure † We slightly changed the definition of the voting output in order to use confidence-based predictions on all levels of the complex ensemble.

Weak Learning
A binary classifier is represented by a map f : R d → {−1, 1} which maps an input instance to a class label. The Vapnik-Chervonenkis statistical learning theory suggests that an effective classifier meets three conditions: (a) The classifier is trained on an adequate and sufficient amount of data; (b) The classifier demonstrates a low misclassification rate on the training data; and (c) The classifier is simple. The sampling method described in Section S1.1 satisfies only condition (a). Conditions (b) and (c) are met by the hypothesis of weak learning -a classification model is a weak learner if it demonstrates a misclassification rate lower than 1/2 and predicts the class labels more accurately than random guessing, or formally: Supplementary Definition S1 (Weak learner). Assume that a classifier is trained on X = {(x x x 1 , y 1 ), (x x x 2 , y 2 ), . . . , (x x x N , y N )} and outputs the hypothesis f (x x x i ), for each i = 1, . . . , N. In addition, with some small probability the classifier's training error is slightly below the one of a random guesser. Since the expected training error of a random guessing classifier is 1/2, one can say that if is true for every real γ such that 0 < γ ≤ 1 2 , then f (x x x i ) represents a weak hypothesis , while the classifier is referred to as weak learner.

Regression Stump Weak Learner
A regressive classification model possessing the characteristics of a weak learner is the regression stump. It learns an optimal separating line, not a hyperplane, by a single dimension of the input data and predicts the output more accurately than a random guess of the class label. It differs from a single-node decision tree (decision stump) by the error function being minimized by the weak learning algorithm. The regression stump, the function f in line 5 of Algorithm 1, is defined as where τ ∈ R is a threshold value, and a and b are unknown real parameters optimized for any specific value of τ, while f is itself fit using the training data X . The 1 ∈ {0, 1} is an occurrence indicator for a random event A (the outputs are selfexplanatory). For brevity, we leave τ out and use the notation f X (x). The boosting round t ∈ [1, T ] is omitted because it is clear from context in these definitions. Adapting Equation (3) to our case of classifying an instance x x x, we get where the k-th dimension coordinate x (k) of x x x is compared to a threshold value τ.
To train a regression stump it is necessary to search for the threshold τ that yields the smallest possible WSSE ε t . Let there be given a training dataset X ∈ R N×d that contains N d-dimensional vectors (instances). The domain of the threshold τ is an unordered set T ⊂ {x (k) 1 , x (k) 2 , . . . , x (k) N } of d sorted lists ℓ 1 , . . . , ℓ d , each containing N − 1 elements in ascending order where α : N → N is the sort map over the set of all integers, and The largest element by each dimension of X is deliberately removed in order to avoid the special case of a trivial regression stump that always predicts only one class. For each τ ∈ T it is necessary to optimize the values of regression coefficients a and b, i.e the optimalâ τ andb τ are chosen such that This equation is solved by setting the two partial derivatives with respect to a and b to zero. Therefore, the set of solutions S of the obtained non-homogeneous system is Furthermore, |S | = 1, i.e., there exists a unique solution to the system, such that S = {(â τ ,b τ )}. Although it involves standard linear algebra theory, we provide a brief theorem for completeness, conciseness and notion of the reader. We use vector notation, where w w w = [w 1 , w 2 , . . . , w N ] T represents the weights, whilst 1 k,τ = T is a vector of indicators and y y y = [y 1 , y 2 , . . . , y N ] T . Additionally, ⟨·,·⟩ denotes the dot (scalar) product of two vectors.

4/34
Finally, the optimal regression stump threshold τ * by any dimension k of X , accompanied by its corresponding optimal regression coefficientsâ τ * andb τ * , is chosen such that it minimizes ε t , or The general regression stump algorithm is outlined in Algorithm 2.

S1.3 Collaboration
In this section we present two margin-based collaborative approaches for bagging of boosting ensembles: Weak-Learner Collaboration (W-CLB) and Strong-Learner Collaboration (S-CLB). The realization of collaboration between individual ensembles is done through information exchange, i.e.,exchange of one or more instances. Both approaches aim to reduce the upper bounds on the generalization error of a subbagged boosting model composed of S Gentle Boost ensembles, but the key difference is the stage at which they occur; W-CLB is injected into the training stage of all Gentle Boost ensembles, while S-CLB operates only on prediction-ready ensembles, i.e., in contrast, the latter occurs just after all S ensembles have been fully trained. Not only do they demonstrate a defiance of overfitting, but also deliver an earnest reduction of the generalization error rate.

S1.3.1 W-CLB
We define W-CLB as a two-phase "data reorganizing process", where we call the phases Pruning Phase (Phase I) and Expansion Phase (Phase II). Both Phases I and II are consecutively, or more precisely, interchangeably repeated at most n exc times, for which an iterator τ = 0, . . . , n exc − 1 is introduced in order to describe the process, as provided below.
Pruning (Phase I). The first step of the W-CLB probe injection at an arbitrary boosting round t involves margin pruning. It consists of sorting the instances within each of the training subsets X (1) , . . . , X (S) (each of size ηN) with respect to their realvalued margins obtained from the regression stumps f t,X (1) , . . . , f t,X (S) , respectively. The process yields where α( · ) : N → N is the sorting map defined over the set of all integers.

5/34
Second, W-CLB operates on positive margins exclusively, whilst negative margins are omitted from consideration. Therefore, before W-CLB resumes further, a filter is applied to remove the negative margins from the sorted list above, resulting in a sub-list Expansion (Phase II). Phase II follows Phase I directly. First and foremost, we ensure that all instance weights within each subset remain unchanged and in their original order and their original assignment to specific instance slots within that subset (we consider an ordered set of instances). At iteration τ = 0, we create a List of Removed Instances (LOR), containing S slots for positions of removed instances corresponding to each subset, When an instance z α(p+τ) ∈ X ( j) at position p + τ is replaced by another one, we keep the original weight at position p + τ from X , or simply, the new instance gets the original weight of the replaced one. Next, after obtaining LOR, a search procedure is initiated to scan the rest of the subsets X (k) , k ̸ = j. As soon as the search procedure comes across an instance z = (x x x, y) ∈ X (k) for any k ̸ = j, such that y f t, it terminates. Finally, the new instance z is copied from X (k) to X ( j) at position LOR[ j] τ , and τ is incremented to τ + 1.
Phase I and Phase II are defined for j ∈ [1, S]. Phase I, as well as Phase II, are individually repeated for each j = 1, . . . , S. Afterwards, Phase I is initiated again, when τ increments by 1 and re-initiation is performed until τ < n exc or until a case when none of the subsets substantiates the conditions for collaborative instance exchange. After W-CLB fully finishes, the Subbagged Gentle Boost algorithm proceeds normally by updating the weights within each subset.
The Algorithm 3 provides the steps of W-CLB for an arbitrary round t. We write F X ( j) to denote the j-th Gentle Boost ensemble being trained on X ( j) , hence F ( j) ≡ F X ( j) . This applies to f ( j) as well, for any j ∈ [1, S]. W-CLB selects at most n exc instances from each training subset and replaces them by counterparts that display greater margins at the source weak learner. W-CLB is injected into Subbagged Gentle Boost, resting on a parameter p c -the probability to run W-CLB at each iteration. For simplicity, this probability is uniformly spread on [1, T ], such that W-CLB is performed on every 1/p c rounds of boosting. In Algorithm 3, instance exchange is performed in an iterative manner, i.e., a single instance from each subset at a time. If the swap fails at line 14, then W-CLB examines the next smallest margin in each subset, and so on. The variable successes keeps track of the number of successful swaps.
Supplementary Definition S2 (Further-trained classifiers). Let X ∈ Z N , N > 1 be a training set over which W-CLB is injected at round t of boosting, yielding X ′ ∈ Z N . Assume that f is a regression stump and F is a Gentle Boost ensemble. Then, a regression stump (respectively a Gentle Boost ensemble) trained according to Algorithm 4 is called a further-trained regression stump, denoted f h t,X ′ and and at iteration t + 1 there are ordinarily trained f t+1,X ′ and F t+1,X ′ .

Margins.
We replicate an existing definition of the margin of an instance x x x with respect to a classifier f . Since the regression stump outputs real values, or more precisely f ∈ [−1, 1], one can think of the margin as the distance of x x x to the decision boundary represented by f . The margin of a misclassified instance is negative because sign[ f (x x x)] ̸ = y, and positive otherwise.
Supplementary Definition S3. (Margins) Let f ∈ R be the real-valued outcome of a classification algorithm trained on some X ∈ Z N . Then the margin of z = (x x x, y) with respect to the decision boundary f is defined as The magnitude of the margin represents the confidence of the decision, while its sign indicates its correctness. In Gentle Boost, the output of the real-valued weak learning algorithm (regression stump) falls between −1 and 1, while the output of the ensemble itself falls between −T and T .

5:
for k = 1, 2, . . . , d do 6: Use Equation (6) to compute the optimal regression coefficientsâ τ ,b τ 9: if ε τ < ε τ * then update then 11: if ε τ * < ε prev then 20: The relationship of W-CLB to margin theory. From SVM theory, it is well known that generalization performance is closely related to the increase of margins. The margin of a classifier is defined as the width of the belt region around the decision boundary enclosed by the instances on each side that are closest to the boundary S3 , that is, the two instances having the two smallest margin magnitudes. Moreover, according to Breiman's reasoning, larger minimum margins would intuitively imply lower generalization error S4 . Therefore, it becomes clear that replacing the instances that define the margin belt by ones that eventually increase it implies lower generalization error. This effect can be thought of as margin relaxation, since the new replacements "relax" a tight margin belt around the optimal decision boundary in terms of WSSE. Margin relaxation is often referred to as minimum margin maximization since it increases the smallest margin magnitudes within the training data. An immediate consequence of margin relaxation is enabling re-adaptation of the decision boundary represented by f . Therefore, f , albeit not necessarily, generalizes potentially better after margin relaxation, and ameliorates the risk of overfitting the training data. Geometrically, relaxation enables translation or rotation of the decision boundary. The magnitude |y f (x x x)| of an instance margin represents the confidence of the prediction.
Algorithm 3 depicts two major deeds that are to a certain extent counterintuitive to margin theory, as well as common knowledge and sense.
-W-CLB focuses on and penalizes correctly classified instances. More precisely, W-CLB does not perform classical margin relaxation since it replaces the most uncertain correctly predicted samples. These instances do not necessarily define the margin belt around the optimal f t , hence W-CLB may fail to widen the belt itself, especially when f t is cluttered by more than n exc mispredicted samples with a negligible confidence. In that case, the margin belt remains unchanged and W-CLB has a heavily "diminished" effect of relaxation. W-CLB is thus contrasting to improving the generalization error based on minimal margins.
-On the other hand, W-CLB operates contrarily to common knowledge and facts in margin theory because it penalizes correct predictions. Common sense implies that it is best to replace the instance that has the smallest negative margin (that is, the most confident wrong prediction) by an instance that has a much greater, positive margin. In the next section, we stress this phenomenon and provide several reasons to justify the actual W-CLB approach.
Although seemingly counterintuitive, Section S2 makes it clear that highly confident correct predictions can be leveraged to fine-tune an optimal weak decision boundary, and most importantly, improve the overall algorithmic stability of Subbagged Gentle Boost.

S1.3.2 S-CLB
The following part of the text provides information about the concept of S-CLB. This collaboration procedure is conducted through T consecutive iterations. At the τ-th iteration of S-CLB, the j-th Gentle Boost ensemble initiates a collaboration procedure referring to its k-th predecessor within the ensemble sequence. By doing so, they form a collaboration pair ( j, k). This collaboration pair goes through three steps described in detail bellow and also presented in Algorithm 5.
Collaboration criterion satisfaction (Step I).
Let F (τ) X ( j,τ) and F (τ) X (k,τ) represent the outcomes of the j-th and the k-th Gentle Boost ensemble, respectively, where X ( j,τ) and X (k,τ) denote the corresponding datasets used for their training at the τ-th iteration. First of all, the instance-label pairs contained in the j-th ensemble's training set which are not used to train its predecessor are selected, thus forming the relative complement X ( j\k,τ) = X ( j,τ) \X (k,τ) . This is done the other way around as well. Furthermore, the margins of all instance-label pairs in X ( j\k,τ) with respect to F (τ) X ( j,τ) , as well as the ones in X (k\ j,τ) with respect to F (τ) , ∀z ∈ Z , then the corresponding sorted margin sequence regarding this ensemble is the following essentially represent the elements of X ( j\k,τ) and X (k\ j,τ) , respectively. Now, these two base ensembles may proceed to collaborate only if both sets X ( j\k,τ) and X (k\ j,τ) are not empty. Otherwise, there is no training information that can be exchanged between these two ensembles in the next step. On the other hand, if the number of instance-label pairs learned by just one of the ensembles is larger then the pre-set parameter n exc , then n exc should be chosen as the maximal number of instances allowed to be exchanged throughout the first step at the τ-th iteration. This number is denoted by n (τ) exc . Accordingly, the collaboration procedure regarding the pair ( j, k) may continue only if the following is satisfied Probe exchange of training information (Step II). As mentioned previously, a convenient method of exchanging training information between two base Gentle Boost ensembles would be one that considers an exchange of training instances. In our case, this method is applied such that the exchange of training instances essentially represents an instance swapping. By "swapping" it is meant that each time an instance is removed from a given training subset and added to another one, it must be replaced by an instance drawn from the latter subset. This provides consistency in terms of the fact that the cardinality of each training subset will always remain the same, while its contents may vary. As to the instance swapping between the j-th and the k-th Gentle Boost ensemble at the τ-th iteration, some or all of the top n (τ) exc instance-label pairs from X ( j\k,τ) whose margins with respect to F (τ) X ( j,τ) have the smallest values are swapped with the corresponding ones from X (k\ j,τ) having the smallest margins with respect to F (τ) X (k,τ) . Thus, the indices of these instance from both X ( j\k,τ) and X (k\ j,τ) , are separated into different sets I ( j\k,τ) The former contains the indices of all instances which may potentially be swapped with those instances whose indices are contained in latter. But, these instances are swapped according to a certain swapping order. A potential swapping order is defined simply as a set of pairs such that a given pair (u, v) within the set represents the swap between z u ∈ X ( j\k,τ) and z v ∈ X (k\ j,τ) , where u ∈ I ( j\k,τ) and v ∈ I (k\ j,τ) . Accordingly, the set of all potential swapping orders is the following In essence, if P(A) denotes the relative complement of the power set of a given set A with respect to ∅ and g : P(I ( j\k,τ) ) n,p j → P(I (k\ j,τ) ) n,p k is a bijective function between the elements of the p j -th subset of ( P(I ( j\k,τ) ) n ) and the p k -th subset of ( P(I (k\ j,τ) ) n ) , then each constituent subset of S ( j,k,τ) , i.e., each potential swapping order can be defines as to denote a swapping order and (u p,b , v p,b ) to denote a swapping pair, i.e.,S All of these swapping orders are iterated and a probe exchange of instance-label pairs is conducted according to each of them. More specifically, given a potential swapping order S is used to obtain the following updates: represent the modifications of X ( j,τ) and X (k,τ) after the occurrence of the p-th probe exchange. The procedure is repeated for each

Satisfaction of a criterion for successful information exchange (Step III).
Upon completion of all probe exchanges, the last step at the τ-th iteration evaluates the successfulness of each one of them. It is the most important step in terms of the model's performance improvement which will be discussed later. In other words, the decision regarding the model's state-change is made within this step. Essentially, after each probe exchange, the model's state is modified, but just temporarily. To decide whether the model's state will be permanently modified, the results from all possible probe exchanges made within Step II are examined and the successfulness of each is measured. For this purpose, a distance metric is evaluated before and after each probe exchange between the j-th and the k-th Gentle Boost ensemble. Afterwards, the measured distances are used to compare the initial model at the beginning of the τ-th iteration against its modification caused by the probe exchange conducted according to the p-th swapping order, for each p = 1, . . . , n (τ) swap . If an optimal swapping order is found in terms of the model's performance, then the model's state is modified accordingly and the τ-th iteration is referred to as state-changing.
To quantify the distances, first, the losses of both F (τ) X ( j,τ) and F (τ) X (k,τ) are calculated with respect to each instance z i within the original training set X . Next, the empirical errors of both ensembles are calculated, before and after the p-th instance exchange. In our case, the empirical error differences of the j-th and the k-th Gentle Boost ensemble are respectively. These two measures are combined in the following error distance measure Afterwards, the sufficiency of the distance value is examined. This is done by defining an indicator variable with respect to the p-th probe exchange as follows p,di f f ≥0 .

10/34
Note that the value of the distance measure dist ( j,k,τ) p , as well as the value of the indicator variable I ( j,k,τ) p , are calculated in the case of each probe exchange of instances, i.e., for each p = 1, . . . , n (τ) swap .
At last, the optimal instance exchange, i.e., the optimal swapping order is the one that maximizes dist ( j,k,τ) p . Therefore, the optimal swapping order S ( j,k,τ) p * is determined by obtaining the following optimization arg max Accordingly, the training sets' contents regarding both Gentle Boost ensembles are updated using S The procedure is repeated at each iteration τ = 0, . . . , T − 1, i.e., for all pairs of type ( j, k), where j = 1, . . . , S and k = j − 1, . . . , 1. Accordingly, the training process of the Subbagged Gentle Boost model consists of T = ∑ S j=2 j = (S − 1)S/2 iterations overall, while T sc ≤ T of them are state-changing. Its step-by-step algorithmic description is presented below.

11/34
Algorithm 5 Strong-Learner Collaboration (S-CLB) 1: procedure S CLB(max number of instances allowed to be exchanged n exc , collaboration pair ( j, k), outcomes F (τ) 3: Sort instances with respect to their margins 5: exc > 0 then 9: Generate the set of all potential swapping orders S ( j,k,τ) using I ( j\k,τ) and I (k\ j,τ) 12: Exchange instances between the j-th and the k-th ensemble according to S end if 28: end procedure 12/34 S1.4 Stability In this part we provide a strong theoretical background for collaboration. The work presented here is based on existing stability theory and upper bounds on the generalization errors of bagging and boosting algorithms.

Notation and Preliminaries
The notation and context in this part is mostly adopted from S5 . Let the sets D ⊂ R d , Y ⊂ R be the input and output space, respectively. When dealing with a binary classification problem, Y is constrained to the values {−1, +1}. Further, let Z = D × Y denote a learning space of input-output pairs. We consider the space of training sets Z N representing all training sets X of size N drawn i.i.d. from an unknown distribution D. Therefore, A learning algorithm A is a function which maps a training (learning) set X onto a function A X from D to Y . It is assumed that the algorithm A is symmetric with respect to X , which means that the algorithm does not depend on the order of elements in the training set. Furthermore, it is assumed that all functions are measurable and all sets are countable S5 .
For a given training set X of size N, two operations are considered for building a modified training set for all i = 1, . . . , N as follows: -By removing the i-th element -By replacing the i-th element by some z ∈ Z drawn from D and independent of X Unless they are clear from context, the random variables over which probabilities or expectations are taken are specified in the subscript. This way, P X [.] and E X [.] are taken to denote the probability and expectation with respect to a random draw of the training set X of size N, according to the distribution D N S5 .
In order to measure the accuracy of the algorithm A we need a measure of loss (a cost function) of a hypothesis f with respect to an instance z = (x x x, y). We will denote the loss of a hypothesis f X (respectively the algorithm A X ) by ℓ( f X , z) (or equivalently ℓ(A X , z)) and quantify it by a cost function c : Y × Y → R + . For brevity, the distribution D of X is left out, and sometimes, unless it is clear from context, the subscript that denotes the training set will also be left out for an additional simplification of notation. Thus, Albeit the leave-one-out error is considered to be an unbiased estimate of the true error of an algorithm, Bousquet and Elisseeff showed in S5 that their upper bounds on the true, i.e., generalization error based on leave-one-out error are strikingly similar to those based on the training (empirical) error. Formally, the generalization error R(A, X ) depending on the training set X for A is defined as Unfortunately, R cannot be computed since the distribution D is unknown. We thus have to estimate it from the available data X S5 . As stated previously, there are two widely used estimators for the error R(A, X ) -the classical leave-one-out error estimator and the so-called empirical, i.e.,training error. Since the results in S5 are strikingly similar for both, in our work we focus on the latter, R emp , where

Definitions of Stability
There are many ways to quantify or define an algorithm's stability, but for the reader unschooled in statistical/computational learning theory, stability is simply the tolerance, or more precisely, the resistance of an algorithm to small changes in the 13/34 training data. A stable algorithm will demonstrate very similar generalization performance when trained on different, unalike training sets. The notion of stability was first introduced (not explicitly S5 ) in 1979 by Devroye and Wagner when they analyzed the error variance of local learning algorithms in Distribution-free inequalities for the deleted and holdout error estimates, and referring to S5 , Kearns and Ron defined it and gave it a name in 1999 in S6 . We now continue with formal stability definitions and theorems that provide upper bounds on R as the theoretical basis of W-CLB and S-CLB. For each existing theorem or lemma, we specify the original source and theorem or lemma number that the reader can refer to for the complete proof, which is often prolix or complicated to be presented here.
Supplementary Definition S4 (Hypothesis stability S5 ). An algorithm A has hypothesis stability β with respect to the loss function ℓ if the following holds This is equal to the expected L 1 norm with respect to our unknown distribution D.
Supplementary Definition S5 (Pointwise hypothesis stability S7 ). An algorithm A has pointwise hypothesis stability β with respect to the loss function ℓ if the following holds With pointwise hypothesis stability, we take the pointwise average (expectation) of the loss perturbations measured on a single instance z i instead of averaging over the whole Z .
Supplementary Definition S6 (Uniform stability S7 ). An algorithm A has uniform stability β with respect to the loss function ℓ if the following holds It is important to note that β is an inversely proportional quantification of stability; as β decreases, stability increases proportionally, and moreover, β is a function of N (sometimes denoted β N ) and the case of interest is when β decreases as 1/N, i.e., β is proportional to O(1/N) S5 .
In some cases, like boosting, the algorithm A works on weighted data, and we thus define an appropriate weighted notion of stability, the so-called L 1 -stability, where the input set is modified in terms of the weight distribution, instead of element removal/replacement. Supplementary Definition S7 (L 1 -Stability S8 ). An algorithm A has L 1 -stability λ , or is λ -L 1 -stable with respect to the loss function ℓ, if for any two distributions p and q on D the following holds This definition was not originally based on the expectation, but was instead defined in terms of raw absolute differences in Definition 2.11 S8 . Defining it in terms of expectation does not imply any conflicts with the existing theory. Interestingly, the L 1 stability is related to hypothesis stability (respectfully pointwise), which is necessary for future analysis.
Supplementary Lemma S1 (Lemma 2.12 S8 ). A learning algorithm has L 1 -stability λ if and only if it has pointwise hypothesis stability 2λ /N. Furthermore, if a learning algorithm has L 1 -stability λ , it has hypothesis stability 2λ /N.
Finally, the following part focuses on upper bounds on the error that hold in certain circumstances, i.e., hold with some minimum probability. We thus provide a definition of (β , δ ) stability.

14/34
Classification Stability. Bousquet and Elisseeff introduce a modified cost function applicable for real-valued classification algorithms which considers the so-called "soft margins". According to S5 , if an algorithm is a real-valued classification algorithm which returns the function f X , then for any γ > 0, the modified cost function regarding the prediction of a given pair z = (x x x, y) is defined as and the adequate algorithm's loss ℓ γ ( f X , z) = c γ ( f X (x x x), y) is called classification loss. This soft-margin-based definition of the loss function is a suitable choice for evaluating the quality of the decisions made by a certain real-valued algorithm for two reasons. First of all, although it can be adapted for use in the case of regression problems, it is intended to be used when dealing with classification ones. Moreover, it is considered to be more "flexible" in terms of its ability to distinguish between reliable and unreliable predictions, rather than focusing only on their accuracy. This is achieved by the function's dependency on the value of γ. In fact, the loss will increase as the value of f X approaches zero, where its critical closeness to zero is controlled by γ. Now, with the choice of the loss function being settled, we can define the stability measures associated with it. The first and more general stability measure, based only on a classifier's output, is the classification stability, while the second one takes the classification loss into consideration. The two measures are related such that this relation is dictated by γ. The formal definition of the former, as well as the lemma regarding the connection between both measures are presented in the text that follows.
Supplementary Definition S9 (Classification stability S5 ). Let A be a real-valued classification algorithm which returns the function f . The algorithm A has a classification stability β , if for any instance-label pair z = (x x x, y) the following holds Note that X \i denotes the training set X after the removal of its i-th element, i.e., X \i = X \ {z i }, meaning that f X (x x x) and f X \i (x x x) represent the outputs of a classifier, trained by means of A on X , with and without learning z i , respectively.
Supplementary Lemma S2 (Lemma 16 S5 ). A real-valued classification algorithm A with classification stability β has a uniform stability β γ with respect to the classification loss function ℓ γ .
The proof of the lemma presented in S5 states that c γ is a 1/γ-Lipschitzian function with respect to its argument representing the classifier's output f X (x x x) for any instance x x x. Consequently, ℓ γ is also 1/γ-Lipschitzian. Thus, for all i, all training sets X , and all z = (x x x, y), We can thus see that γ plays the role of controlling the connection between the classification and uniform stability of A. More precisely, it regulates their ratio.

Upper Bounds on the Generalization Error
This part is a brief overview of existing proved upper bounds of the generalization error R of learning algorithms. These bounds are based on VC dimensions and were introduced by Bartlett et al. in S9 . Later Bousquet and Elisseeff S5;S7 presented upper bounds based on stability, applicable to a large class of learning algorithms, including real-valued classification algorithms and penalizing algorithms, also known as regularization algorithms. Our work involves stability-based upper bounds of R. Starting from existing AdaBoost bounds given in S8 , we extend stability notions to Gentle Boost and Subbagged Gentle Boost. The following theorem that applies to every majority hypothesis F regardless of how it is computed is from S9 (it is provided for completeness): Supplementary Theorem S2 (Theorem 1 S9 ). Let X be a sample of N examples chosen independently at random according to D. Assume that the base (weak) hypothesis space H is finite, and let δ > 0. Then with probability at least 1 − δ over the random choice of the training set X , every weighted average function F satisfies the following bound for all θ > 0

15/34
More generally, for finite or infinite H with VC dimension d, the following bound holds as well: Note that in the theorem above, P D [yF(x) ≤ 0] = R for AdaBoost, where the loss function ℓ is taken to be the Heaviside function of −yF(x), the raw classification loss.

Upper Bounds of Deterministic Algorithms
Switching back to stability theory, there are two major classes of upper bounds on R that are independent of the VC dimension of the base hypotheses -regression-based and classification-based. As mentioned previously, for the latter, the authors in S5 introduce a modified loss function ℓ γ (A X , z) over Z for a real-valued classification algorithm A. This is a notion of the so-called soft margins which will be later used to provide the theoretical background of S-CLB. On the other hand, in the case of W-CLB, we are solely interested in sign[ f (x)], and we thus apply the regression case.
Supplementary Theorem S3 (Theorem 12 S5 ). Let A be an algorithm with uniform (resp. hypothesis and pointwise hypothesis) stability β with respect to a loss function ℓ such that 0 ≤ ℓ(A X , z) ≤ M, for all z ∈ Z and all sets X . Then, for any N ≥ 1, and any δ ∈ (0, 1), the following bounds hold with probability at least 1 − δ over the random draw of the sample X , This theorem gives tight bounds when the stability β scales as 1/N S5 . Applying the regression notion to classification implies neither any misconceptions nor erroneous theory, since classification outcomes are just a looser notion of regression outcomes.

Upper Bounds of Randomized Algorithms
We now examine the stability and error bounds of randomized algorithms. Typical examples include bagging, subbagging and the random forest. This is the follow-up work of Elisseeff in S7 . We focus on the subbagging variation, whose differences with regard to classical bagging were described earlier in Section S1.1. In a nutshell, a subbagging algorithm uses S subsamples for training, X (1) , . . . , X (S) ⊆ X , of size p ≤ N drawn uniformly and without replacement (no duplicates allowed). In our algorithm we take p = ηN. The base model that is being subbagged is also referred to as the base machine. For instance, in the case of our model, Gentle Boost is used as its base machine, regardless of whether W-CLB or S-CLB is injected. The upper bounds on the generalization error R(A, X ), where A is a symmetric randomized algorithm that uses a training set X , with respect to its outcome f X and some loss function ℓ(A X , z), for all z ∈ Z , are the same as the bounds of deterministic algorithms in Theorem S3. This is ensured by Theorem 6 S7 . Therefore, one is interested in only how the stability of a randomized algorithm relates to the stability of its base machine. But, before we proceed, a function f is said to be B-Lipschitzian if for some number Supplementary Proposition S1 ((Pointwise) Hypothesis stability of subbagging for regression, Proposition 4.4 S7 ). Assume that the loss ℓ is B-Lipschitzian. Let Φ X be the outcome of a subbagging algorithm whose base machine is symmetric and has hypothesis (resp. pointwise hypothesis) stability β p with respect to classification loss ℓ, and subbagging is done by sampling p points without replacement. Then, the random hypothesis (resp. pointwise hypothesis) stability β N of Φ X with respect to the loss function ℓ is bounded by An example algorithm that exploits this property was proposed in S10 . The Heaviside and the squared loss, the latter being used by the regression stump, are both 1-Lipschitzian, so we can safely take B = 1 S7 . The ℓ γ=1 loss is 2-Lipschitzian according to the original work, Proposition 4.4 S7 . But, ℓ 1 is never used, and we thus have β N ≤ β p p/N for the W-CLB flavor. More importantly, the bounds in Theorem S3 and all notions of stability are only dependent on the maximum value M of ℓ and B, but not the nature of ℓ itself.

16/34
Supplementary Proposition S2 ((Pointwise) Hypothesis stability of subbagging for classification, Proposition 4.5 S7 ). Let Φ X be the outcome of a subbagging algorithm whose base machine is symmetric and has hypothesis (resp. pointwise hypothesis) stability β p with respect to classification loss, and the subbagging is done by sampling p < |X | = N points without replacement. Then, for the random hypothesis (resp. pointwise hypothesis) stability β N of Φ X with respect to a 1-Lipschitzian loss function ℓ, the following inequality holds

Stability of AdaBoost
This part focuses on the interaction of weakness and stability in AdaBoost. To the best of our knowledge, the only existing and established results for the stability of AdaBoost are S8 and S11 . The authors in S8 showed that AdaBoost is almost-everywhere stable. Here, we extend their theory by applying it to Gentle Boost for the purpose of W-CLB and prove that the same stability results prevail. As a corollary, we also infer (and prove) that the stability results of S8 hold regardless of the risk function minimized by the weak learner in boosting. Let H N = { f q | q has support of size at most N} be the set of all possible classifiers that can be generated by an algorithm A for each possible q, where q is a weight distribution over the training set. H is the space of weak classifiers, such as the regression or decision stump.

In addition, an algorithm A is said to be weak with respect to a distribution D if Weak D (A) > 0.
This is the same notion as Definition S1. Therefore, any non-perfect predictor is regarded as weak.

S2.1 Why W-CLB Works
In this section we provide proofs on the effectiveness of W-CLB and describe how it yields a better upper bound on the generalization error. We will very often focus on a single Gentle Boost ensemble trained on some subset X ( j) ⊂ X of size p ≤ N, hence we eliminate the ( j) superscript when clear from context to ensure both readability and notation simplicity. We now consider Gentle Boost's stability. To the best of our knowledge, there have not been established stability notions for Gentle Boost. Theorem 5.8 S8 holds when the class labels are {0, 1}, and each weak learner f t minimizes the weighted absolute error ε t = w| f t (x x x) − y|. However, the stability of Gentle Boost is slightly different and potentially worse.
Supplementary Theorem S5 (Stability of Gentle Boost). Suppose that the weak learner A has L 1 -stability λ and let ε * = Weak D (A)/2 > 0. Then, for sufficiently large N, for all T , Gentle Boost in T rounds is (β , δ )-stable (i.e., with probability at Our Theorem S5 ultimately states that Gentle Boost is less stable than AdaBoost as a consequence of the squared loss minimized by each constituent weak learner. The following lemma is an immediate consequence.

Supplementary Lemma S3 (Pointwise hypothesis stability of Gentle Boost). Suppose that for sufficiently large N, for all T , and for some X ∈ Z N , Gentle Boost in T rounds boosts a weak learner with pointwise hypothesis stability β w . Then, Gentle
Boost has pointwise hypothesis stability β with probability at least 1 − δ over the random draw of X We now have all the necessary tools and knowledge to derive new upper bounds of the generalization error R of Subbaggged Gentle Boost.
Supplementary Theorem S6 (Generalization error upper bound of Subbagged Gentle Boost). Assume that the loss function ℓ is B-Lipschitzian, and 0 ≤ ℓ(Φ X , z) ≤ M, for all z ∈ Z , where Φ X is the outcome of a subbagging algorithm whose base machine is Gentle Boost. Next, assume that subbagging is done by sampling S sets of size p < N from some X ∈ Z N uniformly and without replacement. Now, let the weak learning algorithm A have (pointwise) hypothesis stability β w with respect to ℓ and let ε * = Weak D (A)/2 > 0. Then, for sufficiently large p, for all T , for Subbagged Gentle Boost in T rounds with probability at least 1 − δ over the random draw of X ∼ D N , Theorem S6 can be applied to other kinds of base machines, regardless of their underlying learning algorithm. Recalling Theorem S6, the upper bound is a sum of two elements: empirical error and overhead cost of stability. The ultimate goal of W-CLB is to reduce both by collaborative exchange of instances that improves pointwise hypothesis stability. We consider the monotonic exponential loss which is an upper bound of the misclassification loss 1 −y f (x x x)≥0 S9 . Moreover, the exponential loss is minimized by Gentle Boost S2 .
(1) Lower upper bound. The term "lower" transliterates to lower empirical error of the complex W-CLB ensemble Φ. The upper bound in Theorem S6 consists of the empirical error plus a numeric term involving stability. In this sense, W-CLB generates an ensemble that has lower empirical error than classical Subbagged Gentle Boost. The effect of lowering the error comes from explicit substitution of small-margin instances. In each of the S Gentle Boost ensembles, there are exactly p c T regression stumps having a lower empirical (training) error than before. The final output Φ is the outcome of the subbagging ensemble, constructed from S Gentle Boost base machines F (1) , . . . , F (S) and is computed by taking the sign of their average. Let R W −CLB emp (A X ) denote the empirical error of an algorithm that outputs A on any level of the complex ensemble, trained on X , when W-CLB is injected into the training process, and A is either a regression stump, Gentle Boost or Subbagged Gentle Boost.
Supplementary Theorem S7 (Monotonic minimization of the exponential loss by Gentle Boost). Let t be the current round of boosting and let F(x x x) be the outcome of a Gentle Boost algorithm from the previous t − 1 rounds of training on a dataset X ∈ Z N . Assume that f t (x x x) is the outcome of a real-valued weak learning algorithm, added to the ensemble, then with respect to the exponential loss, In other words, classical boosting procedures yield a lower exponential loss at each round. When W-CLB replaces some z i ∈ X by some z ′ i at round t of boosting, we have There are two distinct consequences. First, let the new empirical error of f t , trained on X , but with respect to Hence, ε ′ t,partial ≤ ε t . In addition, it is clear that the further-training Algorithm 4 cannot result in a worse error. If ε ′ t is the empirical error of f t,X ′ , it follows that Second, using Equation (22) also results in a lower empirical error with respect to the total exponential loss, i.e., Equation (25) intuitively leads to a lower Gentle Boost error, and we formally prove that in fact it does.
Supplementary Theorem S8 (W-CLB yields almost-everywhere lower empirical exponential loss of Gentle Boost).
Let t be the current round of Gentle Boost with an outcome F t,X (x x x) = ∑ t s=1 f s,X (x x x) and assume that W-CLB is injected after training f t,X (x x x), i.e.,between rounds t and t + 1, yielding f h t,X ′ and F h t,X ′ , respectively. Then, with a high probability of ω, W-CLB yields a lower empirical Gentle Boost error R W −CLB emp (F t+1,X ′ ) at round t + 1 with respect to the exponential loss ℓ(F t+1,X ′ , z i ), z i ∈ X ′ , or ] . (26)

19/34
Proposition S3 implies a lower expected value of the absolute loss difference in Definition S5. In essence, it decreases the lower bound of pointwise hypothesis stability β w of the weak learner. Although better stability for the replaced instance is guaranteed, the absolute loss differences at the other positions potentially perturb, and for this reason we stabilize that by the Further-Training algorithm.
A logical arousing question related to the counterintuitivity of W-CLB is "Why not replace negative margins by positive ones?" The answer consists of two parts: Conservativism. The idea comes from the Gentle Boost algorithm itself. For instance, Gentle Boost (Algorithm 1) is a conservative version of Logit Boost, which displays big minimization steps at pure regions, or more precisely, learns the training data in a quicker fashion S2 . We apply the same conservative idea here; replacing a negative margin by a positive one has an overwhelming effect on reducing the empirical error, but potentially imposes overfitting. Mild improvements as those of W-CLB are clearly both effective and sufficiently strong to improve the generalization error.
Challenge. All incorrectly classified instances remain in the subset. With this, we force the Gentle Boost ensemble to accept the challenge of learning them, instead of providing it with instances it easily classifies correctly. In fact, W-CLB makes a trade-off between improving knowledge and forcing, while prevents faking knowledge. This claim is also numerically confirmed. The further-training Algorithm 4 operating immediately after W-CLB ensures that the regression stump at which W-CLB has been injected either preserves its previous optimal parameters, or gets better ones. Therefore, it ensures a greater or equal sum of the instance margins in the modified training subset at round t. An immediate consequence is the weights- X ′ = f t,X ′ , then each incorrectly classified instance is heavier than its counterpart in round t +1 when W-CLB is not injected. On the other hand, each new instance added by W-CLB gets a lower weight than its counterpart at round t + 1. Thus, it makes sense to ensure that all incorrectly classified instances get larger weights in the next iteration. Moreover, the total increase of the weights of misclassified instances is equal to the total decrease of the weights of newly added instances with larger margins. Increasing negative margins with W-CLB does not guarantee this, i.e.,makes the next regression stump "forget" about some of the incorrect predictions by reducing their weights, which is indeed considered as faking knowledge. These claims are completely true when f h t,X ′ = f t,X ′ , and are conversely slightly looser. We now analyze the worst-case performance of Subbagged Gentle Boost with W-CLB and compare it to Gentle Boost. The Regression Stump algorithm, described in Algorithm 2, has O(d 2 N 2 ) worst-case performance, where N is the size of the training set and d is the number of features, or the dimensionality of the population space. Henceforth, Gentle Boost in T rounds of boosting has O(d 2 N 2 T ) worst-case performance, based on Algorithm 1.
Analogously, the Subbagged Gentle Boost algorithm has O(d 2 η 2 N 2 T S) worst-case complexity. It is worth noting that this performance is usually better than Gentle Boost's performance because it involves portions of the training data. Injecting W-CLB here, leads to the following worst-case performance of our algorithm: ) , since it takes O(S 2 ηNn exc ) to exchange instances using Algorithm 3 and apply Algorithm 4 in O(d 2 n 2 exc ) as a final step. In our experimental evaluation of the proposed method we have observed that the computational overhead imposed by W-CLB often substantiates the possibility of better performance than T rounds of gentle boosting over the whole training set X .

S2.2 Why S-CLB Works
In order to explain how stability theory can be applied for the purpose of improving the generalization performance of our model, let us assume that the data in X is used for training a model based on subbagging of Gentle Boost ensembles, i.e.,a Subbagged Gentle Boost. Moreover, let us assume that our proposed random sampling strategy presented in Section S1.1 is used to divide X into S different data subsets X (1) , X (2) , . . . , X (S) of equal size and multiple Gentle Boost ensembles are trained such that X ( j) plays the role of a training set for the j-th ensemble, for all j = 1, . . . , S. Since a Gentle Boost ensemble is also a classifier itself, produced by an ensemble method, essentially its training can be seen as usage of a learning algorithm A which outputs a hypothesis based on the knowledge gathered from the training data. Analogously, in our case, the training of each Gentle Boost ensemble is conducted by the usage of the Gentle Boost algorithm A which returns the hypothesis function F X ( j) . As in Section S1.3.1, we adopt F X ( j) ≡ F ( j) . It is important that in order to apply the stability theory to provide bounds on the generalization error of a given classification model, the adequate learning algorithm must be symmetric with respect to the data in its supplied training set. It is known that boosting algorithms are symmetric, which approves the appliance of stability theory in our case. So, if a Gentle Boost ensemble is trained on each subset X ( j) by means of A whose outcome F ( j) obviously does not depend on the instance order in X ( j) , then Theorem S3, which was initially introduced by Bousquet and Elisseeff as Theorem 12 S5 , can be applied to bound the generalization error of A. But, as stated in the theorem, the suggested bounds hold only if the algorithm used for training each base machine within the subbagging scheme is a real-valued one. This means that F ( j) must be a real-valued function such that the label of a given instance x x x is predicted by taking the sign of F ( j) (x x x), for each j = 1, . . . , S, which was defined in Equation (??). Note that according to this definition, F ( j) (x x x) does not directly represent the label predicted on x x x by the j-th ensemble, but what it represents is the confidence that the j-th ensemble has in this prediction. This way of defining F ( j) (x x x) enables the usage of the classification loss. As to the decision-making, the final label predicted by the j-th ensemble is sign [F ( j) As stated in Section 4.2.2 S5 , a good real-valued classification algorithm is one that produces outputs whose absolute values truly represent the confidence they have in a certain prediction. Considering this and the nature of the boosting algorithms, we can see that the real-valued output F ( j) (x x x) is intentionally chosen such that, for any instance x x x, |F ( j) (x x x)| is a true representative of the confidence for predicting the instance label sign [F ( j) (x x x)]. Moreover, choosing F ( j) (x x x) as in Equation (??) makes the jth Gentle Boost ensemble eligible for performance evaluation in terms of classification loss. Consequently, both classification and uniform stability can be used to measure the stability of each ensemble. The theorem that follows provides an upper bound on the generalization error of our model regardless of the stability measure choice.
Supplementary Theorem S9 (Classification-loss-oriented upper generalization error bound of Subbagged Gentle Boost). Let ℓ T (Φ X , z) be a T -Lipschitzian classification loss function, for all z ∈ Z , where Φ X : R d → R is the outcome of a real-valued Subbagged Gentle Boost model consisted of S base Gentle Boost ensembles, while each one of them is trained using T > 1 weak learners. Then, for any N ≥ 1, and any δ ∈ (0, 1), with probability at least 1 − δ over the random draw of a training set X , where β p is the stability of the base Gentle Boost ensemble with respect to ℓ T , and η = |X ( j) |/|X |.
The following discussion provides an explanation about the reasons for proposing the S-CLB procedure; in other words, it simply explains why S-CLB works and how this approach contributes to lowering and potentially tightening the upper bound of the generalization error of the whole model. This is done by separately analyzing the rationality of each step within the procedure, thus resulting in a single fused discussion.
Assuming that the j-th and the k-th ensemble are about to collaborate at the τ-th iteration, first they must satisfy the collaboration criterion defined in Step I. For this purpose, the margins of all instances in X ( j\k,τ) and X (k\ j,τ) are sorted separately. It is known that, given an instance-label pair z = (x x x, y), the generalization error R of a boosting ensemble trained on a sample whose examples are chosen independently at random according to a distribution D, is defined as the probability of x x x to have a negative margin, i.e.
Since the j-th and the k-th ensemble are trained using a boosting algorithm (in this case, Gentle Boost), i.e.,they are essentially boosting ensembles, by increasing the margins of the instances in the relative complements X ( j\k,τ) and X (k\ j,τ) , theoretically, the generalization errors of these ensembles should be reduced. This is the case because X ( j\k,τ) ⊆ X ( j,τ) and X (k\ j,τ) ⊆ X (k,τ) , so by increasing the margins of the instances in the relative complements, the margins of some instances from the training sets are increased as well. But, these datasets may also contain instances whose margins already have sufficiently large values and any increase of these values is considered to be of low importance and priority. This means that maximizing an already large margin would cause a far less significant change in an ensemble's generalization performance than maximizing the minimal margins. However, boosting algorithms mostly produce ensembles with large minimum margins S12 , so even after all margins gain relatively large values, further increasing them is still considered to be a positive change towards reducing the ensemble's generalization error as Breiman states in S4 . Therefore, the concept of margins is used to measure the contribution of a given instance to the reduction of the generalization error of its parent ensemble, thus reducing the generalization error of the whole model. In our case, this is achieved by choosing the top n (τ) exc instances from the sorted margin sequences with respect to F (τ) X ( j,τ) and F (τ) X (k,τ) , separately, as the ones which are going to be exchanged in Step II.
After selecting the instances with the minimal margins from both X ( j\k,τ) and X (k\ j,τ) , some or all of the top n (τ) exc instances with minimal margins from the former are swapped with the corresponding ones from the latter. Swapping of the instances with minimal margins is chosen to be the method for exchanging training information between the j-th and the k-th Gentle Boost ensemble for two reasons: • The first one is the need of keeping consistency with regard to the concepts from the stability theory used to derive the model's generalization error bound in Theorem S9. More precisely, the bound holds only if η = |X (s,τ) |/|X | for each s = 1, . . . , S and τ = 1, . . . , T . This means that each instance which is going to be removed from X ( j\k,τ) must be replaced by only one instance from X (k\ j,τ) in order to sustain the original cardinality of X ( j,τ) and X (k,τ) , as well as to keep the equal subset size constraint satisfied.
• The second reason is supported by the fact that only instances from the relative complements X ( j\k,τ) and X (k\ j,τ) having the minimal margins are eligible for swapping. Considering this, it is clear that each instance from X ( j,τ) is going to be replaced by one which is contained in X (k,τ) , but not in X ( j,τ) , and the other way around. This swapping principle guarantees that no instance will be allowed to be duplicated within a single training subset.
As to the instance swapping itself, it is conducted using the set of all potential swapping orders S ( j,k,τ) of at most n (τ) exc swapping pairs. But the optimal swapping order may not be the one according to which n (τ) exc instances from X ( j\k,τ) are swapped with exactly n (τ) exc instances from X (k\ j,τ) , but rather it could be just one that includes swapping of less than n (τ) exc instances from both sets. Hence, S ( j,k,τ) contains all swapping orders of n swapping pairs, for each n = 1, . . . , n (τ) exc . Moreover, an additional driver for generating S ( j,k,τ) is the combinatorial nature of Step II, as well as the fact that without using S ( j,k,τ) the training process would gain vast computational complexity. A worst-case scenario would be one in which X ( j,τ) and X (k,τ) are disjoint, i.e.,X ( j,τ) ∩ X (k,τ) = ∅, while the maximal number of instances allowed to be exchanged equals the number of instances allocated for each data subset. This results in a calculation of all possible swapping orders of at most ηN instances and searching for the optimal one by exchanging instances according to each one of these orders. Considering the number of possible swapping orders in this scenario n for a large value of ηN, i.e.,in the case of massive data subsets, swapping instances between ensembles according to these orders and retraining them afterwards would be simply infeasible. On the other hand, just a few instances swapped between the ensembles can already make an improvement of their individual ability to generalize. So, this way the improvement of the overall generalization performance is not achieved by increasing the number of swapping orders for examination per iteration, but rather by employing a larger number of collaborations (iterations) between the base ensembles. However, a compromise regarding the generation of S ( j,k,τ) can still be made in terms of the training algorithm's execution time. Since n (τ) remains unmodified for each n = 1, . . . , n (τ) exc and throughout all τ = 1, . . . , T . Therefore, these sets can be generated in advance and later used as "lookup tables" during the whole procedure. By performing this simple and yet useful technical trick, a decent complexity reduction can be achieved.
At last, Step III is the one which determines whether a swapping order is the optimal one or not. This is done by measuring the distance defined in Section S1.3.2 (Step III) and comparing its value with the one obtained before instances were exchanged according to a certain swapping order. As to the distance measure itself, it is defined in a conservative fashion such that an information exchange is considered to be successful only if its value is not worsening after the exchange has been made. The contribution of this collaboration-regulatory principle is shown through several mathematical statements that follow.
Supplementary Theorem S10 (Monotonicity of the empirical error estimate). Let Φ X : R d → R be the outcome of a realvalued collaborative Subbagged Gentle Boost model trained on X . If S-CLB is used as a method for collaboration between its constituent Gentle Boost ensembles, then R T emp (Φ (τ) X , X ), as a function of τ, monotonically decreases as the value of τ increments by one.
Note that the proof of the above theorem states that R T emp (Φ (τ) X , X ) is a monotonically decreasing function of τ, but the step by which its value decreases between iteration τ and τ + 1 is determined by the maximal error distance measured at iteration τ, for each τ = 1, . . . , T − 1.
Supplementary Corollary S2 (Almost-everywhere lower classification-loss-oriented upper bound of Subbagged Gentle Boost). The cumulative S-CLB approach yields an approximately lower upper bound on the generalization error of Φ X with high probability of

23/34
if its base machine is already stable, i.e.
Inequality holds if β p has a constant value or in case when the empirical error decrease is more significant than the potential increase in the value of the stability measure.
Supplementary Proposition S4. Let F T,X be the outcome of a Gentle Boost algorithm trained on X in T boosting rounds that acts like a base machine of a real-valued Subbagged Gentle Boost. Then, given two positive integers T ′ and T ′′ such that T ′ ≤ T ′′ , for any instance z i = (x x x i , y i ) ∈ X that is correctly classified by both F T ′ ,X and F T ′′ ,X , Similarly to Proposition S3, Proposition S4 also entails a lower expected absolute loss difference, but unlike the former which refers to the lowest level of the Subbagged Gentle Boost, the latter may contribute to decreasing the lower bound of the pointwise hypothesis stability β p of the Gentle Boost base machine. The above proposition also suggests that, for a large value of T , even after R T emp (Φ (T ) X , X ) reaches 0, the base machine's pointwise hypothesis stability may continue to improve. Moreover, due to the boosting nature of the underlying base machines, for a sufficiently large number of boosting iterations T per each, the decrease of R T emp (Φ (T ) X , X ) will become more significant that the change in the stability measure's value. In other words, the stronger the Gentle Boost machine is, the stabler it gets.
A brief note on the complexity of S-CLB. The worst-case performance of S-CLB can be easily derived in a bottom-up fashion. As stated in W-CLB's complexity analysis part at the end of Section S2.1, a regression stump (Algorithm 2) is trained in O(d 2 N 2 ), meaning that T rounds of gentle boosting a regression stump take O(d 2 N 2 T ). Now, subbagging S Gentle Boost ensembles leads to a worst-case complexity of while the collaboration between each pair of ensembles has The first term in the collaboration complexity refers to the instance exchange process between a pair of ensembles, while the second one represents the complexity of retraining both ensembles after instances are being exchanged. By combining (32) and (33), while considering that S-CLB is conducted through T = (S − 1)S/2 consecutive iterations, for the worst-case complexity of a S-CLB-guided collaborative Subbagged Gentle Boost we get the following 24/34

S3 Supplementary Data Description
All nine datasets used throughout the experimental stage of this research encompass real-world tasks. The description of each one of them is provided below.
• The Australian Credit Approval dataset contains data about credit card applications. It was initially provided by a large bank whose name is confidential. Each instance in the dataset represents an application for a credit card consisted of customer information. The challenge is to classify a customer as (in)eligible for a credit card approval. It is worth mentioning that this dataset is also considered as a noisy one.
• The Breast Cancer Wisconsin dataset concerns cancer diagnosis. More precisely, it contains data regarding patients having a breast tumour that is needed to be used in order to predict whether a patient's tumour is non-cancerous or cancerous, i.e.,benign or malignant, respectively. The data was collected in portions, periodically, by Dr. William H. Wolberg at the University of Wisconsin Hospitals which were later aggregated in a single dataset.
• The Pima people (American Indians originating from southern Arizona) were examined for the presence of diabetes and their patient records were assembled in the Diabetes dataset. All patients were females and all of them were at least 21 years old. The diagnostic, binary-valued variable representing the presence of diabetes is investigated to forecast the onset of diabetes mellitus in this high risk population of Pima Indians.
• The Statlog (Heart) is a scanty dataset containing medical data which can be used to determine an absence or presence of a heart disease.
• The Ionosphere dataset is collected by a system in Goose Bay, Labrador, targeting free electrons in the Earth's ionosphere. It consists of radar data used to classify radar returns from the ionosphere as either "Good" or "Bad". Good radar returns are those that return evidence of some type of structure in the ionosphere, while ones that do not are considered bad.
• The BUPA Medical Research Ltd. used blood tests' records to construct the Liver Disorders dataset, such that each data instance refers to a record of a single male individual. These blood tests were sensitive to liver disorders caused by an excessive alcohol consumption.
• The Lung Cancer data concerns classification between malignant pleural mesothe-lioma (MPM) and adenocarcinoma (ADCA) of the lung. There are 181 tissue samples (31 MPM and 150 ADCA), such that each sample is described by 12533 genes.
• The Mammographic Mass dataset incorporates data generated from mammogram screening for breast cancer diagnosis. A BI-RADS (Breast Imaging Reporting and Data System) assessment, the patient's age and three BI-RADS attributes define a single data instance. Each instance is also associated with the ground truth (the severity field). The primary goal is to use this information to predict the severity of a patient's mammographic mass lesion, i.e.,to determine whether it is a benign or a malignant one. Moreover, these predictions can be also used to calculate the sensitivities and associated specificities which indicate how well a CAD system performs compared to the radiologists.
• The Congressional Voting Records dataset, as its title implies, contains votes for U.S. House of Representatives Congressmen from the second session of the 98th Congress in 1984. Essentially, the problem comes down to classifying each voting record as "republican" or "democrat" based on the 16 additional votes identified by the Congressional Quarterly Almanac. Tables   Supplementary Table S1. Summary of the nine datasets.

S4.1 Parameter Value Selection
The parameter values shown in Table S2 were chosen using an intuitive trial-and-error approach that was for most part driven by the dataset characteristics. We use three major starting points to choose these values; the first baseline to choose η is the dataset size. We assess this value by choosing larger values for small training sets, and vice versa. For instance, Mammographic is the largest one, implying smaller values for η, while for Lung Cancer, being the smallest dataset by size, we use significantly larger values approaching 1. Next, we assess the parameter values by the number of tentative collaborations that were observed as successful, again using a trial-and-error approach, chosen along with n exc . Lastly, η was chosen to improve the robustness in terms of T and S for W-CLB and S-CLB, respectively. In the case of W-CLB, a larger value of T increases the number of collaborations, while for S-CLB this is done by S. In other words, T and S define the collaboration timeframe. Sections S2.1 and S2.2 provide a complexity analysis in detail. For the effect that the model parameters have on its complexity, an additional analysis across different parameter sets was performed to examine how the computational operation count changes with different values of the most significant parameters. The analysis is recapitulated in Figures S1 and S2 for W-CLB and S-CLB, respectively. We anticipate applying parameter meta-optimization and search strategies (e.g., grid search) in our future work.

S5 Supplementary Proofs
Proof of Supplementary Theorem S1 Proof. Start by reinterpreting the theorem: The system of equations obtained by setting the partial derivatives of ε t w.r.t a and b to zero has a unique solution. Let A A A be the coefficient matrix of the system By calculating the partial derivatives and rearranging the terms after some algebraic operations, we obtain the coefficient matrix ⟨w w w, 1 k,τ ⟩ ⟨w w w, 1 k,τ ⟩ ⟨w w w, 1 k,τ ⟩ ∥w w w∥ 1 ) .
Now, applying Cramer's Rule yields that the system has exactly one unique solution if and only if det(A A A) ̸ = 0. We now have to show that det(A A A) ̸ = 0.
2. ⟨w w w, 1 k,τ ⟩ ̸ = 1 for any k ∈ [1, d] because at least one indicator must be equal to 0, which holds even when τ is the least (first) element of any ℓ k in the set T .
It follows from the definition of the set T that det(A A A) ̸ = 0.

Proof of Supplementary Theorem S5
Proof. Let P = [p 1 p 2 . . . p N ] T and P ′ = [p ′ 1 p ′ 2 . . . p ′ N ] T be two weight distributions of a training set X of size N. Finally, let y ∈ {−1, +1}. Now, let a be an unweighted cost c( f , y) for the loss ℓ( f , z) of f on y with respect to p, at an arbitrary round of boosting, and z = (x x x, y) ∈ Z . Then, We now use Inequality (6) in Lemma 5.3 S8 and reproduce it to accommodate Gentle Boost. From the L 1 -stability λ of the weak learner, we have Thus, it follows that because 0 ≤ ( f X (x x x i ) − y i ) 2 ≤ 4 since −1 ≤ f X (x x x) ≤ 1.

Proof of Supplementary Theorem S6
Proof. Let ℓ be a B-Lipschitzian loss function with respect to its first variable and let 0 ≤ ℓ(Φ X , z) ≤ M for all z ∈ Z . Therefore, if our subbagging algorithm has β N pointwise hypothesis stability with respect to ℓ, then applying Theorem S3 yields the following upper bound for the generalization error of Φ X : From Proposition S1, the pointwise hypothesis stability β N at ℓ is bounded above by β N ≤ Bβ p p N .
The base machine used is β p -stable Gentle Boost, where the value of β p is obtained from Lemma S3. This completes the proof and the upper bound of R(Φ X ) follows immediately.

Proof of Supplementary Proposition S3
Proof. We prove Equation (30) by analyzing the first-order partial derivative of the absolute loss difference by y f (x x x). Let ε denote the deviation of the margin y i f (x x x i ) when the corresponding instance z i ∈ X is replaced by z ∈ Z , or Henceforth, since the exponential function is positive over R. Equation (39) is true regardless of ε, or more precisely, regardless of whether ε reduces, increases or even flips the sign of the margin y f (x).

Proof of Supplementary Theorem S9
Proof. Let Φ X be the outcome of a real-valued Subbagged Gentle Boost model, trained on a training set X of size N ≥ 1, that has a uniform (resp. hypothesis and pointwise hypothesis) stability β u N with respect to a loss function ℓ such that 0 ≤ ℓ(Φ X , z) ≤ M, for all z ∈ Z . According to Theorem S3, for any δ ∈ (0, 1), with probability at least 1 − δ over the random draw of X , the generalization error of the overall subbagged model R(Φ X , X ) is bounded from above by Now, if ℓ is a classification loss function, i.e.,ℓ(Φ X , z) = ℓ γ (Φ X , z), ∀z ∈ Z , then by plugging Lemma S2 into the previous expression and considering the fact that ℓ γ is bounded by M = 1, the corresponding upper bound based on the classification loss is where R γ and R γ emp represent the adequate error estimates with respect to ℓ γ , while β c N denotes the model's classification stability. In addition, from the proof of Theorem 17 S5 , we know that regardless of the loss measure choice, the loss-independent generalization error R(Φ X , , ∀z ∈ Z . In order to provide an upper bound on R(Φ X , X ) that is going to hold regardless of the model's stability measure type, a simplified and more convenient upper bound expression is needed. The most straight-forward way to achieve this is to choose γ such that β u N = β c N = β N . This can be done by simply choosing γ = 1. Consequently, it follows that where β N is now the stability of Φ X with respect to ℓ 1 . The main limitation of the bound presented above is the fact that both R 1 emp (Φ X , X ) and β N are based on the ℓ 1 loss measure. Obviously, the ℓ 1 measure's output is a true representative of the loss of a real-valued algorithm A with respect to an instancelabel pair z = (x x x, y) when its margin with respect to A falls between 0 and 1. But, since the output of a Gentle Boost ensemble of T weak learners ranges from −T to T , the values of yF X ( j) (x x x) will fall in the same range, for each j = 1, . . . , S. Consequently, the corresponding margins yΦ X (x x x) = 1 S ∑ S j=1 yF X ( j) (x x x) ∈ [−T, T ], ∀z = (x x x, y) ∈ Z . Therefore, a more suitable way to measure the loss of the whole model and its constituent ensembles is to use the ℓ T classification loss function. So, let β p denote the stability of the model's base machine with respect to ℓ T . We consider the fact that ℓ 1 is 1-Lipschitzian w.r.t. its first variable Φ X , which was presented in the proof of Lemma S2. Thus, by applying Proposition S2, we obtain the following upper bound where p is the size of each data subset X ( j) used to train a single base machine, i.e.,p = |X ( j) |, for each j = 1, . . . , S.

31/34
At last, the difference between the values of both classification loss measures Due to the boosting nature of the underlying ensembles, they must be composed of at least two base learners, i.e.,T > 1 must be satisfied. With this being taken into account, Consequently, Given the fraction η = p/N, after replacing it in the previous expression, we get the resulting upper bound