Machine-learning Approach for the Development of a Novel Predictive Model for the Diagnosis of Hepatocellular Carcinoma

Because of its multifactorial nature, predicting the presence of cancer using a single biomarker is difficult. We aimed to establish a novel machine-learning model for predicting hepatocellular carcinoma (HCC) using real-world data obtained during clinical practice. To establish a predictive model, we developed a machine-learning framework which developed optimized classifiers and their respective hyperparameter, depending on the nature of the data, using a grid-search method. We applied the current framework to 539 and 1043 patients with and without HCC to develop a predictive model for the diagnosis of HCC. Using the optimal hyperparameter, gradient boosting provided the highest predictive accuracy for the presence of HCC (87.34%) and produced an area under the curve (AUC) of 0.940. Using cut-offs of 200 ng/mL for AFP, 40 mAu/mL for DCP, and 15% for AFP-L3, the accuracies of AFP, DCP, and AFP-L3 for predicting HCC were 70.67% (AUC, 0.766), 74.91% (AUC, 0.644), and 71.05% (AUC, 0.683), respectively. A novel predictive model using a machine-learning approach reduced the misclassification rate by about half compared with a single tumor marker. The framework used in the current study can be applied to various kinds of data, thus potentially become a translational mechanism between academic research and clinical practice.

www.nature.com/scientificreports www.nature.com/scientificreports/ machine-learning framework to establish the most appropriate model depending on the applied data, and (2) to apply this framework to existing data from HCC patients to develop an appropriate model for HCC prediction.

Materials and Methods
Patients. From all the patients who visited the liver clinic at the University of Tokyo Hospital between January 1997 and May 2016, we extracted 4242 patients (1311 HCC patients and 2931 non-HCC patients) for whom information on the presence (or absence) of HCC was available and who had undergone laboratory testing on at least one occasion. All the patients in the HCC-positive group had been diagnosed as having HCC at the time of their first visit and had received initial treatment at our institution. Patients who subsequently developed HCC during the follow-up period for chronic liver disease were included in the HCC-negative group in the current study. Patients for whom information on AFP, AFP-L3, DCP, AST, ALT, platelet count, alkaline phosphatase (ALP), gamma-glutamyl transferase (GGT), albumin, TB, age, sex, height, body weight, hepatitis B surface (HBs) antigen, and hepatitis C virus (HCV) antibody status were available were selected. Finally, we included 539 HCC patients and 1043 non-HCC patients with the required information in the current analysis.
The current study was performed in accordance with the ethical guidelines of the Declaration of Helsinki. This research project was approved by the ethics committee of the University of Tokyo (approval number, 11474). Informed consent was obtained in the form of an opt-out on the website. Patients who rejected participation in our study were excluded. The study design was also included in a comprehensive protocol for retrospective studies and was approved by the ethics committee of the University of Tokyo (approval number, 2058).

Diagnosis of HCC.
Hepatocellular carcinoma was diagnosed using dynamic computed tomography (CT) imaging, with hyper-attenuation during the arterial phase and washout during the late phase regarded as a definite sign of HCC 22 . When a definite diagnosis of HCC could not be made using CT, an ultrasound-guided tumor biopsy was performed and the pathological diagnosis was based on the Edmondson-Steiner criteria 23 .

Development of graphical user interface machine-learning framework.
To establish a predictive model, we developed a graphical user interface machine-learning framework using R version 3.4.3 (http://www.r-project. org) and the Shiny and Caret packages. The model had two main components. The first component consisted of the establishment of an algorithm. Comma-separated values (CSV) dataset files with a labeled variable were dragged and dropped onto a dashboard, and the framework automatically implemented supervised learning and developed optimized classifiers and their respective hyperparameters, depending on the nature of the data, using a grid-search method (Fig. 1). We used a linear logistic regression model for the linear classification. The Akaike information criterion was used for variable selection in this model. Algorithms including support vector machines using an RBF kernel, gradient boosting, random forests, neural networks, and deep learning were also used for a non-linear classification model. The classifiers and their respective hyperparameters are shown in Table 1. For deep learning model, we defined two dense layers using ReLU activation function with drop-out ratio of 0.5, and then added output layer with the sigmoid activation function. We compiled the model using binary cross entropy as the loss function. An RMS prop optimizer was used as a hyperparameter for the optimization of deep neural network. The framework automatically selected the best classifier and its respective hyperparameter for the prediction model based on a grid search. The detailed process of searching for the optimal hyperparameters was shown in Supplementary Table 1. Algorithm optimization (e.g., a heatmap of predictive accuracy in a support vector machine [SVM]) or materials to compare the accuracies among the classifiers (confusion matrix or receiver operating characteristic curve) were automatically created.
The second component consisted of the application of the developed model to a new dataset of interest. The CSV dataset of interest was dragged and dropped onto a dashboard, and the software applied the optimized classifiers and hyperparameters developed in the first component and outputted the probabilities of the respective labels. www.nature.com/scientificreports www.nature.com/scientificreports/ Statistical analysis. Continuous variables were expressed as the medians with the first and third quartiles, while categorical variables were expressed as frequencies (%). Comparisons were performed using the Wilcoxon rank-sum or chi-square test for quantitative and categorical variables, respectively. We adopted the approaches used in the developed framework described above to predict the presence of HCC. To evaluate the accuracy of the model, we randomly split a total of 1582 patients into three parts: (i) the training set (80%), which was used to build the model, (ii) the development set, which was used for tuning the model parameters, and (iii) the test set, which was used to evaluate the performance of each classifier and assessed the predictive accuracy of the developed model. We then used a receiver-operation characteristics (ROC) curve analysis to assess the predictive accuracy of our classifier. The area under the curve (AUC) was evaluated as the ability to predict the presence of HCC. The variable importance for class discrimination in the predictive model was assessed using the mean decrease in the Gini impurity 24 .

Results
Patient characteristics. Finally, we extracted 1582 patients from our database (539 HCC and 1043 non-HCC patients). The dataset did not contain any missing data. The patient characteristics are shown in Table 2. The proportions of patients with a male sex, HCV antibody-positivity, and HBs antigen-negativity were significantly higher among the HCC patients, compared with the non-HCC patients. The serum levels of AFP, AFP-L3, DCP, AST, ALP, GGT, and TB, and the patient age were also significantly higher among the HCC patients, whereas the serum ALT level, platelet count, and albumin level were lower.
Predictive accuracy for HCC of each classifier. Table 3 shows the predictive accuracy for HCC presence for each classifier using the optimum hyperparameter that provided the highest predictive value in each procedure. We assessed the predictive accuracy of the developed model in the test set. The predictive accuracy for HCC presence provided by gradient boosting was 87.34%, which was the highest among all the classifiers in our framework. The optimal hyperparameters of this classifier for the data used in the present study were eta = 0.08, gamma = 0.02, max depth = 1, min_child_ weight = 1.5, nround = 300, subsample = 0.5, and colsam-ple_bytree = 0.9. An ROC analysis showed that the AUC, sensitivity, and specificity for this optimal classifier were 0.940, 93.27%, and 75.93%, respectively (Fig. 2). Deep learning was not an optimal classifier for the current data.
Assessment of variable importance for class discrimination of the predictive model. We then investigated the variable importance of the optimal predictive model using the gradient boosting developed in the current study. Figure 3 shows the mean decrease in the Gini impurity of this model. Patient age followed by three tumor markers and albumin level were the most important variables for HCC prediction.
Predictive accuracy for HCC of single tumor markers. We also investigated the diagnostic accuracy of models using a single tumor marker. Using cut-offs of 200 ng/mL for AFP, 40 mAu/mL for DCP, and 15% for AFP-L3 25 , the accuracies of AFP, DCP, and AFP-L3 for HCC presence were 70.67%, 74.91%, and 71.05%, respectively. We also plotted the ROC curves for the prediction of HCC for three tumor markers (Supplementary Fig. 1). The AUCs for the prediction of HCC for AFP, DCP, and AFP-L3 were 0.766, 0.644, and 0.683, respectively.  Table 1. Classifiers and their respective hyperparameters and R packages used. * Scalar value, specifying the relative importance of the regularization function. † An option to specify one or more values for the probability of a type-I error. ‡ A parameter for the soft margin cost function, which specifies the allowance of a misclassification penalty for stability. § A parameter to specify the complexity of the separation margin. || A learning rate or step size shrinkage used in an update to prevent overfitting. ¶ Minimal loss reduction required to make a further partition on a leaf node of the tree. ** Maximum depth of tree to control over-fitting; increasing this value makes the model more complex. † † Minimum sum of instance weight needed in a child node. ‡ ‡ Maximum delta step allowed for each tree's estimation. § § Subsample ratio of training instance. |||| Subsample ratio of columns when constructing each tree. ¶ ¶ Total number of trees included in the forest model. *** Number of features used in the construction of each tree. † † † Number of units in hidden layer (number of nodes in each hidden layer was set as 1). ‡ ‡ ‡ A regularization parameter to avoid over-fitting. |||||| Fully connected neural network with 4 layers of neurons (16-64-64-2). ¶ ¶ ¶ A single training iteration over the entire training data. **** Number of training samples processed at an iteration. † † † † A device to adjust the deep learning model for optimal execution. www.nature.com/scientificreports www.nature.com/scientificreports/

Discussion
In addition to tumor marker levels, biomarkers of liver inflammation, liver fibrosis, liver function, and the hepatitis virus status are commonly measured in daily clinical practice. These biomarkers can be used to predict the presence of HCC. Ideally, all clinically available information should be used for such predictions. In the current study, we developed a graphical user interface framework to establish the most appropriate model automatically depending on the applied data using a machine-learning approach and then assessed the accuracy of the model.
Model fitting is important for a successful predictive method. If the data is linearly separable, a linear model will fit the data 26,27 . However, if the data is linearly inseparable, a non-linear model will fit the data better. Therefore, classifiers should be selected depending on the nature of the data. Also, the learning parameters of each classifier should be tuned properly using a grid search method 28,29 to obtain the ideal hyperparameters providing the highest predictive values. Using the optimal hyperparameter, gradient boosting (non-linear model) provided the highest accuracy (87.34%) for the data used in the current study. This model reduced the misclassification rate by about half, compared with a single tumor marker.
Personalization is one of the ultimate goals of modern medicine 30 . Predictive models provide a personalized assessment of the probability of a clinical event using patient-specific characteristics and have increasingly been incorporated into practice in the field of cancer medicine [31][32][33] . The framework developed in the current study can be used to identify optimal classifiers easily and can be applied to new datasets of interest containing various kinds of data, thus potentially becoming a translational mechanism between academic research and clinical practice.   www.nature.com/scientificreports www.nature.com/scientificreports/ Deep learning has enabled major breakthroughs in the processing of images, video, speech, and audio 34 . However, deep learning was not the optimal classifier in the current study. Deep learning requires a large polynomial sample size in terms of the dimensions of the input and an exponential sample in terms of the depth of the network to obtain ideal convergence boundaries 35 , which may be unrealistic requirements in clinical settings for ethical or methodologic reasons. Instead, identifying optimal classifiers and hyperparameters depending on the available data is important. The framework developed in the current study may help to provide optimal models.
The previous studies which compared the predictive performance of tabular data also showed the highest predictive performance of gradient boosting in the medical fields (e.g, urinary tract infections 36 , hip fractures 37 , sepsis 38 , or bioactive molecules 39 ). Notably, Chiew et al. showed outstanding performance of gradient boosting compared to other machine learning algorithms for the risk prediction of suspected sepsis patients in the emergency department using relatively small number of sample 38 . Gradient boosting may be the best algorithm for the analysis of tabular data especially in the medical field where it is difficult to collect a large amount of data for ethical or methodologic reasons.
In the future, predictive models using machine learning approach may be implemented in electronic medical record system and may offer decision support to improve patient outcomes and reduce clinical diagnosis error in daily medical practice. The accuracy of diagnostic algorithm based on machine learning approach depends on the number of samples for training 40 . Larger quantities of multidimensional medical data will be stored in the future, potentially improve the accuracy of machine learning based classifier. Discrepancy of disease distribution between train and test samples is also an important factor for the performance of each classifier. The predictive model developed in the current study is based on the data of tertiary referral center requires. Therefore, further  www.nature.com/scientificreports www.nature.com/scientificreports/ study with external validation in a community-and clinic based population is needed to assess the practical performance of the current model.
In conclusion, the framework developed in the current study provided a novel predictive model of HCC, producing an area under the curve of 0.943. This model reduced the misclassification rate by about half, compared with that for a single tumor marker. The current framework can be applied to various kinds of data, and thus could potentially become a translational mechanism between academic research and clinical practice.

Data Availability
The datasets generated during the current study are available from the corresponding author on reasonable request.