Introduction

Most of the biomedical data are rectangular in shape, including those of ‘omics, large cohorts, population studies, and surveys. Few missing data were present in these datasets. An increasing number of human genomic and survey data have been produced in recent years [1]. Rigorous analyses on these large datasets demand considerably more efficient and more accurate algorithms, which are poorly understood.

Machine learning (ML) algorithms are aimed to produce a model that can be used to perform classification, prediction, estimation, or any other similar task [2, 3]. The unknown dependencies/associations are estimated based on a given dataset and later can be used to predict the output of a new system or dataset [2,3,4]. Therefore, ML algorithms have been used to analyze large biomedical datasets [5,6,7]. Studies have compared the accuracy of several ML algorithms for classifying microarray or genomic data [8,9,10], and show a superior performance of random forests (RF). However, only a few studies accessed the accuracy of ML algorithms in classifying multi-category outcomes, which are important for the more-in-depth understanding of biological and clinical processes. Only when we can accurately predict or identify the factors linked to multicategory outcomes, we can provide patients with more targeted prevention and treatment for the most likely outcome. Hence, we aimed to understand the performance and efficiency of RF, decision tree (DT), artificial neural networks (ANN), and support vector machine (SVM) algorithms in classifying multi-category outcomes of rectangular-shaped biomedical datasets.

As an example, we used a large population-based breast cancer dataset with long-term follow-up data to create a large rectangular dataset. The reasons were: (1) cancer is one of the most common diseases [11]. Knowledge gained by studying cancer can be easily generalized to other biomedical fields; (2) breast cancer is the second most common cancer in U.S. women [11, 12] and would provide a sufficiently large sample size (n = 45,085); (3) breast cancers diagnosed in 2004 had 12 years of follow-up on average and very few missing outcomes/follow-ups were anticipated; (4) breast cancers diagnosed in 2004 would have a moderately good prognosis and their outcomes will be rather diversified (i.e., many patients might not die of breast cancer). Therefore, this study is designed to systematically compare the performance and efficiency of DT, RF, SVM, and ANN algorithms in classifying multicategory causes of death (COD) in a large biomedical dataset (breast cancers).

Methods

Data analysis

We obtained individual-level data from the surveillance, epidemiology, and end results-18 (SEER-18) (www.seer.cancer.gov) SEER*Stat database with treatment data using SEER*Stat software (Surveillance Research Program, National Cancer Institute SEER*Stat software (seer.cancer.gov/seerstat) version <8.3.6>) as we did before [13,14,15]. SEER-18 is the largest SEER database including cases from 18 states and covering near 30% of the U.S. population. The SEER data were de-identified and publicly available. Therefore, this was an exempt study (category 4) and did not require an institutional review board (IRB) review. All incidental invasive breast cancers of SEER-18 diagnosed in 2004 were included and had the follow-up to December 2016. The individual deaths were verified via certified death certificates (2019 data-release). We chose the diagnosis year of 2004 with the consideration of the implementation of the sixth edition of the tumor, node, and metastasis staging manual (TNM6) of the American joint commission on cancer (AJCC) in 2004. Moreover, we only included the primary-cancer cases that had a survival time >1 month, age of 20+ years, and known COD.

The features (i.e., variables) were dichotomized for more efficiency and slightly better performance [7], while the actual values were also tuned for and analyzed using RF models. A total of 54 features were included (Supplementary Table 1). We conducted correlation analyses and produced a correlation matrix (Supplementary Fig. 1) to identify the closely correlated factors. The outcomes of the classification models were the patient’s five-category COD. The COD was originally classified using SEER’s recodes of the causes of death, which were collected through death certificates of deceased patients (https://seer.cancer.gov/). We simplified the SEER COD into five categories based on the prevalence of COD [16,17,18], including alive, non-breast cancer, breast cancer, cardiovascular disease (CVS), and other cause.

The most common task of ML techniques in the learning process is classification [19]. The tenfold cross-validation approach was used to tune all models, which is also termed as model optimization (see Supplementary methods) as described before [20, 21]. This approach is an approximation of leave-p-out cross-validation has the advantage of repeat the process to reduce variance and being less sensitive to the partitioning of the data than the holdout method. Briefly, the samples were randomly divided into ten same-size subsamples, among which one subsample remained as the validation set and the other nine as the test set. The (cross-)validation would be performed ten times as the test subsample was shuffled among the ten subsamples. The mean of the cross-validation’s performances would be calculated and used as the performance of the model. Multinomial logistic regression (MLR) and the ML analyses were carried out using MATLAB (version 2018a, MathWorks, Natick, MA).

Model tuning

The detailed model tuning process is described in Supplementary material. Several DT methods, such as CHAID, CART, and exclusive CHAID are available with MATLAB [22]. We used classification and regression tree(CART) to predict the categories, using the Gini index as the split criterion and 100 iterations for each run (Supplementary Fig. 2). There are no default RF packages/toolboxes in MATLAB’s own toolbox. We thus used the Randomforest-Matlab open-source toolbox developed by Jaiantilal et al. [23, 24].

In tuning RF models, the parameter nTrees, which was to set how many trees in a random forest, and may have an impact on the classification results. We set the value of nTrees from 1 to 600 separately, the results show that if this parameter is not too few (greater than ten), the accuracy of recognition can reach 69–70% (Supplementary Fig. 3). Therefore, we set this parameter to 136 in RF-based analyses. The value of Mtry node for the best model performance was identified by setting the parameter value from 1 to 20 with 1 as the interval. We found 5 was the value, which was indeed consistent with the default value generated using the MATLAB model’s default (i.e., Mtry = floor(sqrt(size(P_train,2)))).

Among different training algorithms of ANN, we used the Trainscg algorithm because it is the only conjugate gradient method that did not require a linear search. The number of input layer nodes is 54 and the number of output layer nodes is 6. To tune the ANN model, we conducted experiments of either single hidden or double hidden layers, with the node numbers ranged from 5 to 100. According to the tuning results of accuracy and mean squared error (MSE), we would set the model to double hidden layers, and the number of layers for the highest accuracy and lowest MSE.

We used the multi-class error-correcting output codes (ECOC) model the SVM modeling which allows classification in more than two classes; and the MATLAB fitcecoc function that creates and adjusts the template for SVM [25]. The Kernel functions considered in the SVM were: linear, radial basis function, Gaussian, and polynomial.

Performance analysis

We analyzed the performance metrics of each proposed model, including accuracy, recall, precision, F1 score, and specificity [26, 27]. They were defined as follows:

$$\mathrm{Accuracy} = \frac{{\mathrm{TP + TN}}}{{\mathrm{All}}}$$
(1)
$$\mathrm{Precision} = \frac{{\mathrm{TP}}}{{\mathrm{TP + FP}}}$$
(2)
$$\mathrm{Recall} = \frac{{\mathrm{TP}}}{{\mathrm{TP + FN}}}$$
(3)
$$F1 = \frac{{2 \ast \mathrm{Precision} \ast \mathrm{Recall}}}{{\mathrm{Precision + Recall}}}$$
(4)

True positive (TP) and true negative (TN) were defined as the number of samples that are classified correctly. False positive (FP) and false negative (FN) were defined as the number of samples that are misclassified into the other mutational classes [26, 27]. The specificity or true negative rate (TNR) is defined as the percentage of mutations that are correctly identified:

$$\mathrm{Specificity} = \mathrm{TNR} = \frac{{\mathrm{TN}}}{{\mathrm{TN + FP}}}$$
(5)

The receiver operating curve (ROC) is a graph where recall is plotted as a function of 1-specificity. It can more objectively measure the performance of the model itself [25]. The model performance was also evaluated using the area under the ROC, which is denoted the area under the curve (AUC). An AUC value close to 1 highlight a high-performance model, while an AUC value close to 0 demonstrate a low-performance model [28, 29]. AUC is independent of the class prior distribution, class misclassification cost, and classification threshold, which can more stably reflect the model’s ability to sort samples and characterize the overall performance of the classification algorithm. The formula used to determine the AUC can be written as follows [29, 30]:

$$\mathrm{AUC} = \frac{{{\sum} {\mathrm{TP} + {\sum} {\mathrm{TN}} } }}{{P + N}}$$
(6)

where P is the total number of positive class and N is the total number of negative class.

Dimension reduction based on the information entropy and information gain

Information entropy is an indicator to measure the purity of the sample set. The formula is as follows:

$$H(S) = - \mathop {\sum}\limits_{i = 1}^n {p_i\log } p_i$$
(7)

The measure of information gain is to see how much information a feature can bring to the classification system. The more information it brings, the more important the feature. Information gain IG(A) is the compute of the difference in entropy from start to end the set S is split on an attribute A, the information gain is defined as follows [31]:

$${\mathrm{IG}}(A,S) = H(S) - \mathop {\sum}\limits_{t \in T} {p(t)H(t)}$$
(8)

where H(S) is the entropy of set S, T is the subsets created from splitting set S by attribute A such that \(S = U_{t \in T}t\), p(t) is the proportion of the number of elements in t to the number of elements inset S, and H(t) is the entropy of subset T.

The information gain of a feature can indicate how much information it brings to the classification system and can be used as a feature weight. When the model uses more features the classification time will be longer. Arbitrarily reducing the characteristics will likely reduce classification accuracy. Therefore, we screened the features based on the calculated information gains to achieve a balance between run time and classification accuracy. We then step-wise deleted the features of the least information of gain, which were likely the least important. Because DT and RF were both ensemble-based algorithms and had similar performances, we conducted dimension reduction with RF, ANN, and SVM models and expect similar results with DT models.

Results

Dataset characteristics and the model tuning

Of the 52,818 samples, we step-wise excluded the 5294 (~10%) cases missing tumor-grade data (_grade), 1770 (3.4%) missing survival-time and 352 (0.6%) missing laterality data (lat_bi), and 317 (0.6%) missing data of TNM6 N category. Overall, we included 45,085 cases of breast cancer diagnosed in 2004 that were included in the SEER-18 and had no missing data in all features (Table 1). The variables were dichotomized into 54 features. Likely due to the unique nature of the pathological and socioeconomic factors, the correlation analyses showed only a few features had a correlation coefficient >0.95 or <−0.95 (Supplementary Fig. 1).

Table 1 Baseline characteristics of the women with breast cancer diagnosed in 2004 that were included in this study.

For DT models, the minimum number of leaf nodes with the best performance of the DT range 20–300 and peak at 93 (Supplementary Fig. 2). Therefore, the minimum number of samples contained in the leaf node was 93 for optimization. The overall classification accuracy could reach 67.3%, which was about 4% higher than the original DT, and the cross-validation error has also decreased. However, due to the uneven distribution of the data samples, some categories such as non-breast cancer and other cause were pruned, causing the loss of some data information.

For RF models, the parameter nTrees set how many trees in a RF and may have an impact on the classification results. The Mtry is the number of features randomly sampled as candidates at each split and was optimized (Supplementary methods). After tuning the models, we set the nTrees parameter to 136 in RF-based analyses with the best Mtry node value of 6 (Supplementary Fig. 3).

According to the tuning results of accuracy and MSE in ANN models, when the number of layers was greater than 20, the models’ performance appeared stabilized (Table 2). Therefore, we set the model to double hidden layers, and the number of layers is 40.

Table 2 Performance of artificial neural network models with various numbers of layers.

For the SVM models of linear, radial basis function, Gaussian or polynomial function, we found the linear kernel function had the highest accuracy (69.06%) and shortest run-time (175.50 s, Table 3), with a one-vs-one approach.

Table 3 Classification accuracies and efficiencies of support vector machine models by the kernel function.

Performance analysis results

Based on the confusion matrices (Fig. 1), the 4 ML models appeared to have similar performance (Table 4) and all were more accurate than the MLR. The best classification accuracy of DT, RF, ANN, and SVM models in this study was 69.21%, 70.23%, 70.16%, and 69.06%, respectively, and higher than 68.12% of a conventional statistical algorithm (MLR). However, to evaluate the pros and cons of a model, it is not enough to just look at the accuracy. The values of recall, TNR, and F1 can specifically reflect the classification of each category. Further comparing the metrics among the four models show that the precision-recall and F1 in classifying breast cancer by RF model were better than the DT, ANN, and SVM models. This is mainly because the RF model is an integrated learning algorithm; through the voting mechanism, one can balance a certain error. Compared with ML algorithms, MLR had lower overall accuracy and a much lower recall (1.13% vs ~50%) and a lower specificity (Table 4) for predicting the death of breast cancer.

Fig. 1: Confusion matrices of the 4 tuned machine learning models.
figure 1

There were similar performance metrics of decision tree (A), random forest (B), artificial neural networks (C), and support vector machine (D) models.

Table 4 The performance of the decision tree, random forests, artificial neural networks, and support vector machine models.

ROC analysis results

After a comparative analysis of the above performance metrics, we found that the RF model was superior to the DT, ANN, and SVM models. However, in the classification of other causes, the ANN model has a higher recognition rate than the RF model, and the F1 values of non-breast cancer cannot be calculated. Therefore, we used the ROC curve to further analyze non-breast cancer and other cause in RF and ANN models.

The AUC of the 5 COD in the RF model was overall lower than those in the ANN model (Fig. 2). Because the AUC index can measure the performance of the model more objectively, the results show that the overall performance of the ANN model is similar or better than that of the RF model.

Fig. 2: The receiver operator curve of the tuned random forests (RF) and artificial neural networks (ANN) models by 5 causes of death.
figure 2

The areas under curve (AUCs) in the RF model were overall lower than those in the ANN model.

Dimension reduction based on the information entropy and information gain

This dataset had 54 categorical/binary features. Based on the training datasets, the information entropy and information gain of 54 features were calculated (Supplementary Table 3). Using information entropy and information gain, we obtained the following important features: age >65; surgery; TNM6 metastasis subgroup1; AJCC stage 4; TNM6 metastasis subgroup2; surgery–other; TNM6 tumor subgroup1; TNM6 tumor subgroup2; AJCC stage 1; surgery-lumpectomy; TNM6 lymph node subgroup5; and TNM6 lymph node subgroup1. Then we characterized the key features in a step-wise fashion (Supplementary Fig. 4).

We successfully reduced the data dimension based on information gain and shortened the run times in RF, ANN, and SVM models, while maintaining the overall classification accuracy (Table 5 and Supplementary Tables 35). Removal of features with low information gain (0.0000–0.0005) in RF models led to a slight increase in the alive class and the overall accuracy rates, while no accuracy changes in CVS and breast cancer classes. The classification of CVS was always 100%; accuracy rate of alive class and the overall had a slight improvement, respectively about 0.5% and 0.3%; breast cancer and CVS had no significant changes; non-breast cancer accuracy was slightly reduced, and the classification effect is unstable. Therefore, the features with low information gain (0.0000–0.0005) may be considered as redundant features, and deleted in the models, while the running times were scientifically reduced. We also found similar changes in ANN and SVM models.

Table 5 Information-gain-based dimension reduction in the three machine learning models.

Feature importance in RF models using the data of convention encoding or one-hot encoding

Our previous works have shown that one-hot encoding (dichotomization of features) led to a slight increase in prediction accuracy of the RF model on prostate cancer using Stata [7]. Consistent with that finding, our tuned RF model on actual values had a prediction accuracy of 69.70%, which was slightly lower than the accuracy of 70.23% on dichotomized features (Supplementary Table 6). However, the top-five important features were different in the two models, except age 65+ years (vs <65 years, Supplementary Figs. 5 and 6).

Discussion

We here compared the performance and efficacy of DT, RF, ANN, and SVM in classifying five-category outcomes of a large rectangular database (54 features and ~45,000 samples). The accuracy in classifying five-category COD with DT, RF, ANN, and SVM was 69.21%, 70.23%, 70.16%, and 69.06%, respectively. It is noteworthy that the accuracy in classifying five-category outcomes is much more difficult than classifying binary-category outcomes since it depends on four sequential classification processes in one-over-all or one-over-rest algorithms (i.e., final accuracy is the accuracy multiplication for four times). Moreover, based on the information entropy and information gain of feature values, we could reduce the feature number in a model to 32 and maintained a similar accuracy, while the running time decreased from 55.57 s for 54 features to 25.99 s for 32 features in RF, from 12.92 s to 10.48 s in ANN and from 175.50 s to 67.81 s in SVM. The DT algorithm was not tested after dimension reduction for its lower performance than RF and its theoretical framework (ensemble-based) like RF.

Few studies to our knowledge investigated the dimension reduction of multicategory classification. A two-stage approach has been shown to effectively and efficiently select features for balanced datasets [32], but no specific reduction in run time was reported. We here show that dimension reduction and efficiency improvement can be achieved by removing features of low to medium information gain (<0.0005) in RF, ANN, and SVM models, which apparently have little effect on the overall classification performance. Such a strategy may be applied to other ML models in classifying unbalanced large rectangular datasets, while caution should be used when classifying outcomes in a balanced dataset.

Consistent with a previous report [7], one-hot encoding (i.e., dichotomization of the features) produced slightly better prediction accuracy in our data than conventional encoding and also has the advantage of no need for normalization. The top-five important features in the models based on the dichotomized and actual-value features differed considerably except age. This finding further confirms the important role of age in modeling five-category COD. However, five of the top ten important features shared in these two models, including advanced age, pathologic staging, surgery status, N category, and histology type. These features thus should be carefully examined, recorded, and considered for predicting or preventing five-category COD in breast cancer patients.

The four included ML algorithms each have their own theoretical frameworks. DT is a logical-based ML approach [33]. The structure of the DT is similar to a flowchart. Using top-down recursion, the classification tree produces the category output. Starting from the root node of the tree, test and compare property values on its internal node, then determine the corresponding branch, and finally reach a conclusion in the leaf node of the DT. This process is repeated at each node of the tree by selecting the optimal splitting features until the cut-off value is reached [34]. A leafy tree tends to overtrain, and its test accuracy is often far less than its training accuracy. By contrast, a shallow tree can be more robust and be easy to interpret [35].

DT works by learning simple decision rules extracted from the data features. But RF is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently, and with the same distribution for all trees in the forest [36]. When RF used for a classification algorithm, the deeper the tree is, the more complex the decision rules and the fitter the model [36, 37]. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Random decision forests overcome the problem of overfitting of the DTs and are more robust with respect to noise.

ANN technique is one of the artificial intelligence tools. It is a mathematical model that imitates the behavior characteristics of animal neural networks [38]. This kind of network carries on distributed parallel information processing by adjusting the connection between a large number of internal nodes, so as to achieve the purpose of processing information. After repeated learning and training, network parameters corresponding to the minimum error are determined, and the ANN model classifies the output automatically from the dataset.

SVM is another popular ML tool based on statistical learning theory, which was first proposed by Vapnik and his colleagues [39, 40]. Unlike traditional learning methods, SVMs are approximate implementations of structural risk minimization methods. The input vector is mapped to a high-dimensional feature space through some kind of non-linear mapping which was selected in advance. An optimal classification hyperplane is constructed in this feature space, to maximize the separation boundary between the positive and negative examples [39, 40]. Support vectors are the data points closest to the decision plane, and they determine the location of the optimal classification hyperplane.

The 4 ML algorithms had different strengths and weaknesses, while all outperformed conventional statistical algorithm (MLR). The RF algorithm in our study seems to have the best overall performance for its lack of being unable to classify some CODs and the best overall accuracy. Despite the similar classification accuracy (~70%), the DT algorithm could not accurately classify the non-breast cancer group. Given the similar theoretical framework, we did not access its performance after dimension reduction. The ANN algorithm in our study is most efficient before and after dimension reduction. Surprisingly, we also notice a small increase in accuracy after dimension reduction which warrants further investigation. The SVM algorithm in our study appears very sensitive to the subgroup size (i.e., number of samples) and was not able to classify two of the five COD, although it also acceptable classification accuracy.

The study’s limitations should be noted when applying our findings to other databases. First, this type of rectangular database is typical in survey and population-study, but not so in computational biology. The major difference is the large p in ‘omics datasets vs the large n in epidemiological datasets, which referred to feature number and sample number, respectively. Second, some of the outcomes were not accurately classified. It is likely owing to the unbalanced outcome distribution. On the other hand, such an undesired situation reflexes real-world evidence/experience. Further studies are needed to improve the classification accuracy in the classes of fewer samples. Third, our works were exclusively based on the MATLAB platform and may not be applicable to other platforms such as R or python. Fourth, a molecular subtype of the cancer was not available due to the lack of human epidermal growth factor receptor 2 (Her2) data. Fifth, the prediction accuracy of the ML algorithms was moderately acceptable (~70%), although ML had much better recall and sensitivity than MLR. The challenge in predicting multi-category outcomes remains outstanding, despite the use of tuned ML algorithms. Thus, future works are needed to improve the prediction performance. Finally, ideally, we should use a large database to validate our models, but it is very difficult to curate and apply the tuned models to another large database that is similar to the SEER database.

In summary, we here show that RF, DT, ANN, and SVM algorithms had similar accuracy, but outperformed MLR, for classifying multi-category outcomes in this large rectangular dataset. Dimension reduction based on information gain will significantly increase the model’s efficiency while maintaining classification accuracy.