RERT: A Novel Regression Tree Approach to Predict Extrauterine Disease in Endometrial Carcinoma Patients

Some aspects of endometrial cancer (EC) preoperative work-up are still controversial, and debatable are the roles played by lymphadenectomy and radical surgery. Proper preoperative EC staging can help design a tailored surgical treatment, and this study aims to propose a new algorithm able to predict extrauterine disease diffusion. 293 EC patients were consecutively enrolled, and age, BMI, children’s number, menopausal status, contraception, hormone replacement therapy, hypertension, histological grading, clinical stage, and serum HE4 and CA125 values were preoperatively evaluated. In order to identify before surgery the most important variables able to classify EC patients based on FIGO stage, we adopted a new statistical approach consisting of two-steps: 1) Random Forest with its relative variable importance; 2) a novel algorithm able to select the most representative Regression Tree (RERT) from an ensemble method. RERT, built on the above mentioned variables, provided a sensitivity, specificity, NPV and PPV of 90%, 76%, 94% and 65% respectively, in predicting FIGO stage > I. Notably, RERT outperformed the prediction ability of HE4, CA125, Logistic Regression and single cross-validated Regression Tree. Such algorithm has great potential, since it better identifies the true early-stage patients, thus providing concrete support in the decisional process about therapeutic options to be performed.


Regression Tree
Regression Tree 1 is a non-parametric method a that recursively partitions the predictor space into It is important to note that the optimization is local. It means that in the greedy methods there is no assurance that successive locally optimal decisions lead to the global optimum 3 . Moreover, ( ) = ∑ ( ) ∈̃, is the loss function of the entire tree, where ̃ is the set of its terminal nodes.
Having found the best split * , the data are partitioned into two regions and the splitting process is repeated on each of them. This procedure can be carry until when in each leaves there is only 1 case; in this way, we are in presence of overfitting and the tree, denoted Tmax, is not a good predictor. Hence, an important issue is the choose of the tree size.

Random Forest
One of the major complaints of tree-based model is their instability. Small changes in the predictor distribution can drastically change the structure of the resulting tree. A consequence of unstable methods is that the prediction error is high.
An approach that mitigates this problem and increases the accuracy of the predictors consists of developing a population of simple models, called base or weak learner (in our case trees), within the perturbed training set and combining them in order to form a composite predictor (see Figure S.M.1 for a graphical representation of how these algorithms work). These models, known as ensemble learning, include, among others, Bagging 5 , Boosting 6 and Random Forest 7 . The last one, repeatedly used in this paper, have become increasingly popular in medicine, genetics and in neurosciences. Random Forest Algorithm -Regression # Set parameters BOOT #number of replications nmin #identify a minimum node size g #number of variables selected by the algorithm at each node of the tree # For i=1 to BOOT { (a) Draw a bootstrap sample booti of size N from the training data (b) Grow a tree Tbooti to the bootstrapped data, by recursively repeating the following steps for each node of the tree, until the minimum node size nmin is reached.
(i) Select g variables at random from the r covariates (ii) Take the best split/variable among the g variables available (iii) Split the node in two child nodes. } From the ensemble of trees, the prediction at a new point x is: From Random Forest is possible to extract two variable importance measures which identify the covariates that have a major impact on the prediction of the response variable. In this paper we consider only one of them, the Total Decrease in Node Impurity (known also with the name of Gini Importance). For evaluating the discriminatory power of a variable, this measure accumulates the Gini gain over all splits of trees grown in the forest 8 .
In detail, at each tree of Random Forest, the heterogeneity reductions due to variable Xr over the set of nonterminal nodes are summed up and the importance of Xr is computed averaging the results over all the trees of the ensemble. Formally, let be the decrease in the heterogeneity index due to Xr at the nonterminal node ∈ of the Ti tree. The variable importance of r-th variable over all the trees is: where is the indicator function which equals 1 if the r-th variable is used to split node m c and 0 otherwise.