Decision making on vestibular schwannoma treatment: predictions based on machine-learning analysis

Decision making on the treatment of vestibular schwannoma (VS) is mainly based on the symptoms, tumor size, patient’s preference, and experience of the medical team. Here we provide objective tools to support the decision process by answering two questions: can a single checkup predict the need of active treatment?, and which attributes of VS development are important in decision making on active treatment? Using a machine-learning analysis of medical records of 93 patients, the objectives were addressed using two classification tasks: a time-independent case-based reasoning (CBR), where each medical record was treated as independent, and a personalized dynamic analysis (PDA), during which we analyzed the individual development of each patient’s state in time. Using the CBR method we found that Koos classification of tumor size, speech reception threshold, and pure tone audiometry, collectively predict the need for active treatment with approximately 90% accuracy; in the PDA task, only the increase of Koos classification and VS size were sufficient. Our results indicate that VS treatment may be reliably predicted using only a small set of basic parameters, even without the knowledge of individual development, which may help to simplify VS treatment strategies, reduce the number of examinations, and increase cause effectiveness.

Vestibular schwannoma (VS) is the most common tumor of the temporal bone. It is a benign, mostly solitary and slowly growing tumor that grows from the Schwann cells of the vestibular portion of the 8th cranial nerve. VS causes approximately 80% of the tumors of the pontocerebellar angle, and around 8-10% of intracranial tumors 1 . The symptomatology of VS is mainly caused by the compression or destruction of the surrounding structures, and an obstruction in the flow of cerebrospinal fluid, and comprise mainly asymmetric hearing loss 2,3 , unilateral tinnitus 4 , or balance disorders and cefalea 5 .
Basically, there are two possible approaches to a patient with a VS: a wait-and-scan (WaS) strategy during which the patient undergoes regular checkups with no active treatment, and an active treatment of the tumor. A long WaS monitoring might eventually lead to an increased tumor size and subsequent complicated operation; however if there is no VS progress, such conservative treatment is economic and harmless to the patient. The active treatment (surgery or radiotherapy) is more beneficial in smaller tumors 6 . Although there is always a chance that the tumor will not grow and no intervention would be necessary, the length of postponement of active intervention (even with relatively small tumor growth) can worsen the results [7][8][9] . Therefore, an untimely decision on active treatment might lead to poorer results and unnecessary costs.
At the initial diagnosis and during the subsequent regular checkups, a number of diagnostic variables is gathered. Based on these variables and their dynamics, a decision on further treatment is made. However, contributions of the individual variables to the final decision may vary; furthermore, for some variables the static values are important, while for other variables the dynamic change is the key. Knowledge of these principles is Figure 1. An outsketch of the methodological process used in the analysis. After cleaning the data, the problem was solved in two parallel tasks (CBR and PDA). Using several feature selection methods followed by expert evaluation, the most important predictors of active VS treatment were identified. The identified set of predictors was processed by several classification methods to create models capable of predicting the active VS treatment based on the predictor values. The performance of the models was analyzed using various performance metrics. . For the 53 patients who were retained in the wait-and-scan regime, the median duration of the overall investigation period was 51 months inclusive of 5 checkups. Within the actively treated group, the median duration of the wait-and-scan regime lasted 37 months and required 3 checkups. The raw data obtained by commonly used diagnostic techniques were organized in a table, where each row represented a single diagnostic checkup which either resulted in active treatment or not, and where columns corresponded to diagnostic variables as follows: • Pure tone audiometry [PTA (dB)]-pure-tone hearing thresholds measured separately for each ear at eight frequencies from 0. 25  Several other functions were examined in the patients (auditory brainstem response (ABR), otoacoustic emissions (OAE), vestibular function), however, they were either not recordable (ABR, OAE) or were not consistently provided over the course of time, therefore they were excluded from the current analysis.
Subjective characteristics of the patients, such as vertigo or tinnitus, were also gathered but were not included in the current analyses. The current study was designed as entirely non-parametric and data-driven; therefore to avoid any possible subjectivity we purposely suppressed the influence of non-deterministic factors, including the patients' subjective characteristics. For the same reason, all incomplete records were removed instead of artificially imputing the missing values. Additionally the phase of data transformation was omitted, as it usually leads to the normalization or equalization of data distributions. Although our restrictions caused the loss of some information, this approach avoids unjustified biases, is fully repeatable and extendable, and as such represents a core baseline model, which can later serve as a reliable benchmarking etalon for comparison with alternative ways of processing; namely including traditional parametric statistical techniques.
Data processing-general. The applied methodology follows the general Knowledge Discovery in Databases process, introduced in 11 or 12 . The data were processed with supervised, internally transparent machine learning methods as follows: • No a priori assumptions concerning the cumulative characteristics of data were made, so the presented results are not biased by any artificial modifications, like imputations or transformations. • Only complete records were selected for further processing.
• Two complementary approaches: (1) static, anonymized CBR, and (2) personalized PDA, were applied to discover knowledge hidden in a multi-dimensional space. • CBR assigns single medical records (rows) to the binary target decisions on the treatment (WaS/active), it considers neither the characteristics of individual patients, nor their history of VS progress. www.nature.com/scientificreports/ • PDA also performs binary classification, but works with the temporal courses of selected variables taken from the complete WaS checkup history of single patients. Thus, every processed sample summarizes the complete column-wise WaS history of the given patient. • An interactive reduction of dimensionality (feature selection) preserving the meaning and relations among the original variables was performed, to exclude the less significant features, and simplify the problem and increase the generalization capabilities of the resulting structures. • The data for all supervised learning tasks were equally balanced with respect to the target, and randomly divided into the training, validation, and test sets in the proportion 50:30:20. The first two partitions were used for learning and optimization of the desired type of discrimination function, the last subset contained unseen data and served for the numeric evaluation of classification performance.
The supervised elimination of redundant features. An initial reduction of dimensionality was performed in all classification tasks, using the five below-listed techniques implemented with the StatExplore, HP Random Forest, Gradient Boosting, Variable Selection and HP Variable Selection nodes of SAS Enterprise Miner: (1) Decision or classification tree [13][14][15] with Chi-square split 16 .
In addition to these algorithms, an expert (manual) selection of the most significant features was performed, which is also the main output from the knowledge elicitation phase. At the end of this iterative process, we proposed the minimal set of variables efficiently characterizing the analyzed problem, based on the outputs of the previous five algorithmic methods. The primary criterion for selection of a given variable was its occurrence among the best ten candidates, which must be either greater or equal to 3, or its average ranking lower or equal to 5. Perspective combinations of such preliminarily selected candidates were interactively analyzed, to eliminate the least significant members and maximize the credibility of the discovered knowledge.
Supervised learning and classification. In the classification stage we used the following techniques: (1) Decision tree, random forest, gradient boosting and logistic regression, all referred to in the previous section. (2) Support vector machine with radial basis function kernel [30][31][32] .
The optimal classifier was selected as the best performing combination of the six feature selection techniques given in the previous section (logistic regression, decision tree, random forest, gradient boosting, LASSO, and interactive expert selection) with the six types of classifiers given here.

Performance metrics.
To evaluate classification performance, several indicators were used: • Accuracy (ACC)-the rate of correct classification for the evaluated data set: where TP is the true positive, TN is the true negative, FP is the false positive, FN is the false negative, P is the all real positive (P = TP + FN), N is the all real negative (N = TN + FP) cases.
• Sensitivity (also recall or true positive rate, TPR)-the ability to correctly classify TP cases: • Specificity (also selectivity or true negative rate, TNR)-the ability to correctly classify TN cases: • Precision (also positive predictive value, PPV)-the rate that the predicted positive is TP: • Area under the Receiver operating characteristic curve (AUC) 35,36 . Practically applicable classifiers should have AUC > 0.6, while AUC > 0.9 indicates an excellent performance.

Results
The general diagnostic data of the patients included in the analysis are illustrated in Fig. 2. These graphs show the number of subjects having a certain result of ABR and distortion products of OAE (DPOAE) examinations, as well as subjective characteristics such as hypacusis or tinnitus. Figure 3A depicts averaged audiograms recorded from both healthy and VS ears during the initial examination, plus the average audiogram of the diseased ears recorded immediately before the change from wait-and-scan to active treatment. Figure 3B shows the histogram of Koos grades recorded during the initial examination in wait-and-scan patients, patients who were later changed to active treatment, and in the actively treated patients recorded immediately before the change from wait-and-scan to active treatment. The section below summarizes the results of the two interrelated analytic phases, dimensionality reduction including knowledge extraction, and supervised learning for both CBR and PDA experiments.

CBR-dimensionality reduction and knowledge extraction.
The output of this method is a set of the most important diagnostic characteristics (variables) along with their significant values. The method aims to provide a transparent set of rules which, using the values of the selected variables, can simply be used generally to support the decision on VS treatment.
Initially, the dimensionality of the full set of CBR variables was reduced with five algorithmic methods (see Table 1). Each of the methods provided 10 variables, rated as the most important for the prediction of VS treatment. Using the variables suggested by the algorithmic methods we manually performed an expert ranking, resulting in an initial version of a reduced set of variables (denoted as CBR EXP INI). By interactive minimizations of this initial set we finally proposed a minimum set of variables (CBR EXP FIN), necessary for the reliable prediction of VS treatment. Table 2 shows the performance for different sets of variables; it is obvious that the removal of unnecessary variables actually improves the prediction accuracy, and furthermore, the output generated by the expertly found features is comparable with the average performance of the three best automated supervised classifiers and feature selectors marked as CBR CLASS (see Tables 5 and 6 in the next section). In addition, Table 2 presents the quality of adaptation on known samples (an average of performance on training and validation data).  www.nature.com/scientificreports/ Based on the aforementioned findings, we can claim that knowledge of the Koos classification, SRT, and three PTA-derived variables, provides sufficient information for a reliable VS surgery decision; even in the case of a single medical checkup. Therefore it may be feasible to exclude clinical tests of the less significant features, which can make the daily diagnostic routine faster and cheaper.
Using the individual variable values, it is now possible to decide whether to perform active treatment (Yes decision) or not (No decision). An important question is what the boundary values of the variables are, i.e., at which level each variable switches the decision from No to Yes. The answer, however, is not unique because the selected features can be assigned into numerous structurally different solutions with comparable performances. One possible solution is given in Fig. 4, and in detail in Table 3. By traversing this binary decision tree according to the rules, we finally arrive at the decision in the leaves; decision accuracy in the leaf nodes is approximately 80%. The ability of the decision tree to also handle missing (N/A) values is yet another advantage of this technique. An example of several CBR records taken from our data and the corresponding decisions is shown in Table 4.
These experimental results confirmed the applicability of the variable set CBR EXP FIN for the reliable predictions of VS surgery. The presented structural representation (i.e., the decision tree in Fig. 4) can help practitioners in a more informed analysis of diagnostic results.
CBR-supervised learning. The previous method gave a transparent set of significant variables and their values that can be directly used for the prediction or decision on VS treatment. However, its result is generally ambiguous; furthermore, our intention to minimize the variable set as much as possible might lead to a certain loss of accuracy. For such reasons, we also decided to create a black-box-like solution based on an automated feature selector followed by a classifier. We identified and parametrized perspective combinations of the six feature selectors with six classifiers. As in the previous method, the CBR data set was split into the training, validation, and test partitions, and batch processed for all the 36 combinations of feature selectors and classifiers. The results of classification accuracy are summarized in Table 5. Table 5 shows that the gradient boosting algorithm is on average the best performing algorithm for both data processing phases (i.e., it works the best both as a feature selector and as a classifier). The globally best result was generated by its combination with neural network (89%). The performance of the fixed expert selection of variables in the CBR EXP FIN set is also remarkable, particularly when followed by a gradient boosting classifier.
Full results of the three best performing combinations are shown in Table 6, the corresponding Receiver operating characteristic (ROC) curves are depicted in Fig. 5. The slightly worse performance for the train and Table 1. Predictors, extracted from CBR data, ordered according to their significance for applied dimensionality reduction method. Irrelevant variables were rejected from expert selection. Variables and metrics are expressed as follows (see "Methods"): (a) suffix 4 holds for the basic frequency range, suffix 8 holds for the full frequency range; (b) subscripts H and VS hold for the healthy or diseased ear, respectively, subscript D holds for the difference of averaged PTA values between the two ears; (c) tailing abbreviations have the following meaning: AR average row wise, IR intercept row wise, SR slope row wise.   Table 6 shows that the absolute test accuracies, as well as biases and variances of the winning combinations, are sufficient for daily use.
To compare the results obtained from the traditional, two-stage processing with those obtained from a complementary one-shot algorithm, we processed the full CBR dataset with the Deep learning algorithm. A set of experiments employing this modern technique was performed on a fully connected thee-layered network. The layers included 38, 76, and 2 neurons with the rectified linear activation function. The network was trained with gradient descend back-propagation method. Such paradigm resulted in the following best performance:     www.nature.com/scientificreports/ which is slightly worse than performance of classifiers with separate feature selection and classification stages. This result was partially determined by low cardinality of the processed dataset, as the Deep learning approach is suitable particularly for processing of extensive multidimensional datasets.

PDA-dimensionality reduction and knowledge extraction.
While the CBR data set and the corresponding methods generated their predictions based only on a single medical checkup, the PDA data set takes into account the individual history of checkups for each patient. It is evident that the time-dependent development of diagnostic variable values may bring important information into the decision process. Therefore, we also repeated the same ranking and specification procedures described for the CBR data set for the PDA data set, in order to minimize the number of input variables and to obtain a transparent set of decision rules. The variables suggested by the feature selectors and the structure of the resulting expert set (PDA EXP ) are shown in Table 7. Table 8 shows the detailed performance metrics of the gradually optimized variable set. As with the CBR data set, in this case we also see the positive effect of the lower number of inputs on the overall performance and primary role of size-oriented VS metrics. The decision tree constructed from the PDA EXP FIN variable set naturally suppressed both the PTA-related indicators, as is shown in Fig. 6 and Table 9. The result can be simply interpreted: if there is any change in Koos classification from the previous checkup, surgery is recommended. If the Koos class remains unchanged, the Size growth is checked and if the trend is positive, surgery is indicated. Generally, both identified variables are so significant that no other diagnostic procedures are necessary (neither the expertly identified PTA). Regardless, if they were performed, the results can enhance the existing CBR knowledge base.
PDA-supervised learning. Supervised PDA experiments suffered from the low number of samples and, consequently, the small size of the test set. Although this fact was efficiently compensated with the inherent dominancy of both the size-related variables, test classification outputs were discretized into several levels, as obvious from Table 10. The overall weaker performance of the interactively selected set of features PDA EXP FIN was caused by its fixed and relatively wide structure in comparison with the other dimensionality reduction tech-ACC = 82%, PPV = 78%, TPR = 89%, TNR = 75%, AUC = 88%, ASE = 13% Table 7. Predictors, extracted from PDA data set, ordered according to their significance for each dimensionality reduction method. Irrelevant variables were rejected from expert selection. Variables and metrics are expressed as follows (see "Methods"): (a) suffix 4 holds for the basic frequency range, suffix 8 holds for the full frequency range; (b) subscripts H and VS hold for the healthy or diseased ear, respectively, subscript D holds for the difference of averaged PTA values between the two ears; (c) tailing abbreviations have the following meaning: AR average row wise, IR intercept row wise, SR slope row wise, AC average column wise, IC intercept column wise, SC slope column wise, LD last difference, TD total difference.  Table 8. Performance of gradually reduced set of variables for the PDA data set. www.nature.com/scientificreports/ niques. In this specific situation, LASSO algorithm demonstrated the best average feature selection capabilities and its main component, logistic regression, as one of the most powerful classification algorithms on a global scale. Such conclusions correspond with the general knowledge concerning the classification of over-determined binary targets 37 , and were also confirmed with the detailed characteristics of the best performing algorithms for the PDA task, presented in Table 11. Accordingly, the PDA data analysis confirmed the statement that the interim growth of VS itself, is the strongest and sufficient predictor of VS surgery.

Discussion
Over recent years, several studies have addressed the possibility of predicting VS growth, or a change from a conservative to an active treatment [38][39][40][41][42][43][44][45][46][47][48] . Their outcomes are, however, ambiguous; some studies are inconclusive or fail to find any significant predictor of VS growth 38,45 . The majority of the previous results state that the tumor size and also the degree of vestibular disorder are the key variables which influence the switch from conservative to active treatment. The above mentioned studies mostly analyzed the individual progress of symptoms, i.e., they worked in a manner similar to our PDA. Two studies specifically tested the hypothesis that VS growth could be predicted by the available data at diagnosis (i.e., the approach similar to our CBR); the study of Herwadker    49 found no significant predictors, while Wolbers et al. 50 identified the long duration of hearing loss and intracanalicular localization of the tumor as the main predictors of a non-growing VS.
Here we present a novel approach to this issue which uses semi-supervised machine-learning techniques to create, parametrize, and evaluate four different models for the prediction of active treatment of vestibular schwannoma: (1) CBR-prediction from static variables a. automated black-box classifier providing predictions given the input data b. transparent set of rules (a decision tree) to support the decision on VS treatment (2) PDA-prediction from dynamic variables a. automated black-box classifier providing predictions given the input data b. transparent set of rules (a decision tree) to support the decision on VS treatment The models were trained, validated, and tested using different subsets of the source data, which means that their performances (accuracy etc.) represent realistic values obtained with unknown data. In the applied methods, we concentrated on preservation of the original meaning of the individual attributes so that they remain transparent and interpretable during the entire classification process. This means that we used no multiplicative or other nonlinear transformations, but we instead employed only generalized linear models (LASSO, logistic regression, decision tree) and generalized (random) additive models, represented with the gradient boosting and random forest approaches. Although the latter two approaches are internally non-transparent, they still work with the original meaning of attributes.
The major findings state that using a simple decision tree it is possible to predict VS treatment, even from the static values of a few basic variables (Koos classification, speech reception threshold, and pure tone audiometry), with approximately 80% accuracy. Ultimately a higher accuracy (89%) can be achieved using a black-box classifier on the static data. From the dynamic point of view, we found that VS treatment can be predicted using dynamics of solely size-oriented variables (Koos classification and 1D size), both with a decision tree and with the black-box classifier. The prediction accuracy is slightly higher than that of the CBR approach.
Besides the provided prediction mechanisms alone, our analyses also indicate that only pure-tone hearing thresholds in both ears, speech reception threshold in the diseased ear, and Koos classification, are necessary at the first checkup (these variables are used in the static predictions); while during the subsequent follow-up, mainly the size-derived metrics and their dynamics play a role in the decision process. These findings might help to make the procedures related to the monitoring and treatment of VS patients more time-and cost-efficient, by eliminating the unnecessary measurements.
Supervised feature selection. The selection of the most important variables is essential in classification tasks where the number of available samples is comparable with the number of input variables, as over-fitted structures are characterized by the poor classification of unknown samples and low generalization ability [51][52][53][54] . Considering that both CBR and PDA tasks belong to this category, an initial reduction of dimensionality was unavoidable. Employing the outputs of five dimensionality reduction techniques, we manually performed an expert selection of the most significant features. We believe that the final selection, numerically over-performing the initial configuration, optimally characterizes the key diagnostic symptoms, based on which the reliable VS surgery decision can be made at the very earliest.
Supervised learning and classification. The supervision in learning lies in the fact that the searched discrimination function is built from samples with a-priori known output membership. In contrast to the dimensionality reduction, internal interpretability of the learned classifier is not required, which results in a blackbox-like nature. The previously introduced tree and regression-based techniques were re-used for the selection of significant variables but, as opposed to the manual interpretation of their results, performed in the feature selection process; this first stage was followed here by a learned classification algorithm.
The main mission of the classification task is the best performing inference, i.e. an accurate assignment of real-world clinical data to the predefined classes (in our case, wait-and-scan versus active treatment). Such black-box-like solutions are widely accepted in practice nowadays, especially in connection with deep learning applications 55 . Moreover, the user can still interact, even with the nontransparent classifies, and analyze their Table 11. Detailed metrics for three best performing classifiers on PDA data set.

ACC PPV TPR TNR AUC ASE ACC PPV TPR TNR AUC ASE
Neural network (logistic regression)  85  82  91  79  88  14  90  88  95  86  100  10   Logistic regression (decision tree)  85  82  91  79  81  14  90  88  95  86  100  10   Logistic regression (gradient boosting)  85  82  91  79  80  15  90  88  95  86  100  www.nature.com/scientificreports/ responses by manually adjusted inputs. An optimal classifier was selected as the best performing combination of a feature selection technique with a learned classifier. For the CBR data, it was found to be the combination of gradient boosting and a neural network; in the case of the PDA data set, the optimal performance was achieved using combinations of a logistic regression/neural network, or decision tree/logistic regression, or gradient boosting/logistic regression.
Potential limitations of our study and future directions. We are aware of the potential limitations of our study. Firstly, although we have assembled a relatively large amount of data from our participants, the final cleaned set contained a smaller number of records due to inconsistency in examination over the years (especially in the cases of ABR and OAE, as they were often not present during the initial examination), and unavailability of some variables in some of the records. A lower number of records may cause a decreased performance of the model, yet it avoids biases resulting from the usage of incomplete or potentially incorrect data. In the current analyses, we primarily focused on audiometric data, although information about potential vestibular pathology could be added to the decision making process in the future. Secondly, we omitted the patients' subjective input to avoid any subjectivity in the data set; however, our clinical experience shows that the subjective worsening of symptoms (that does not necessarily match the objective measurements) might be a strong factor influencing the decision about further VS treatment. Thirdly, our approach to the VS treatment is not purely based on objective measures, but also on the patients' preference and expectations, and also on the surgeons' experience and skill level; therefore the presented model is not expected to replace those inputs, but to support the decision making in deciding whether to directly opt for surgery or wait and scan. Based on our results the future perspectives of our research using the supervised machine learning approach will be the inclusion of not only audiometric but also the vestibular data from our subjects, which would lead to an even more complex prediction model of the VS behavior. The conclusions formulated from supervised learning will be further enhanced with unsupervised analyses, including the linear and nonlinear clustering of data and variables, applied to the full-dimensional data set.

Conclusions
Using semi-supervised machine-learning algorithms complemented with expert (manual) interactive analyses, we developed practical tools to support the decision process related to the treatment of vestibular schwannomas. These tools comprise of simple decision rules (decision trees) for both static and dynamic data offering accuracy of around 80%, and automated black-box classifiers offering even better performance. Our results already indicate that from the initial data obtained at diagnosis (size of the tumor (Koos classification and 1D size in T2 weighted MRI), speech perception (described by SRT) and pure tone average), it is possible to predict the need of VS active treatment. Furthermore, we propose minimum sets of diagnostic variables which are crucial for deciding on VS treatment. Overall, these findings can be used to make the diagnostic and decision-making procedures more time-and cost-efficient, by focusing on the important metrics and eliminating the unnecessary measurements.

Data availability
Data are available at the authors upon request.