A machine learning approach to predict healthcare cost of breast cancer patients

This paper presents a novel machine learning approach to perform an early prediction of the healthcare cost of breast cancer patients. The learning phase of our prediction method considers the following two steps: (1) in the first step, the patients are clustered taking into account the sequences of actions undergoing similar clinical activities and ensuring similar healthcare costs, and (2) a Markov chain is then learned for each group to describe the action-sequences of the patients in the cluster. A two step procedure is undertaken in the prediction phase: (1) first, the healthcare cost of a new patient’s treatment is estimated based on the average healthcare cost of its k-nearest neighbors in each group, and (2) finally, an aggregate measure of the healthcare cost estimated by each group is used as the final predicted cost. Experiments undertaken reveal a mean absolute percentage error as small as 6%, even when half of the clinical records of a patient is available, substantiating the early prediction capability of the proposed method. Comparative analysis substantiates the superiority of the proposed algorithm over the state-of-the-art techniques.

An electronic health record (EHR) is an electronic version of a patient's clinical history over time. It comprises all administrative clinical data of a patient in a healthcare organization, including his/her demographics, diagnosis, medications, laboratory data, and associated costs, and so on. The plethora of longitudinal patients' data of an EHR can be utilized for developing patient-centered personalized healthcare solutions, including cost. It is however worth mentioning that the healthcare costs, ranging from clinician's fees to the cost of hospital stays and medicines, are escalating at a rapid rate around the world 1, 2 . It has motivated the researchers to take keen interest in controlling this upsurge in the healthcare costs. The crucial step to control the healthcare cost is to enable the healthcare organizations to predict the possible future cost of individual patients. It in turn helps to identify the individuals at the highest risk of enduring the significant costs in future. It thus helps to prioritize the allocation of scarce resources among the patients in a healthcare organization for efficient care management.
Moreover, a report from The Commonwealth Fund (2012) emphasizes the need to identify high-cost patients as the first step towards achieving "rapid improvements in the value of services provided" 3 . A proactive approach to address this problem is to identify patients who are at risk of becoming high-cost patients accurately before substantial unnecessary costs have been incurred and health condition has deteriorated further. Eventually, this calls for prediction of possible total healthcare cost of a patient as early as possible when a limited volume of clinical records of the given patient is provided. In other words, another important aspect in the context of healthcare cost prediction is to devise a model using a training set of complete clinical records of some patients to predict the total healthcare cost of a new patient as accurately and also as early as possible, preferably before the availability of the patient's full-length clinical record. Such early prediction of future healthcare cost can be used to judiciously identify high-risk high-cost patients and prevent crises in healthcare organizations. It is obvious that the earliness of the prediction may affect the accuracy. It has motivated the researchers to build a model to predict healthcare cost as early as possible while maintaining an appropriate level of accuracy.
Nevertheless, healthcare cost prediction based on individual patient's characteristics is a challenging issue from the data mining perspective due to the non-Gaussian skewed distribution of the cost data of the patients 4 . Studies in 5,6 reveal dubious efficacy of the statistical methods to predict the healthcare cost. Furthermore, the traces of linear regression and rule-based approaches are also found in literature 2,6 for the cost prediction. But the requirement of a lot of domain knowledge has restricted their applications for most of the real world economic data of the patients 7 . Now-a-days, machine learning algorithms, including clustering and classification techniques, have emerged as an alternative effective tool for this purpose 8, 9 . OPEN 1 Basque Center for Applied Mathematics, Bilbao, Spain. 2  www.nature.com/scientificreports/ This paper proposes a machine learning based novel approach for healthcare cost prediction of individual patient's treatments based on their clinical actions, jointly including the clinical activities and the respective cost over time. The activity here represents diagnosis, medication, pharmacy and the like. A two-step procedure is employed in the learning phase: (1) in the first step, the ordered sequences of clinical actions of the patients' treatments are clustered using the hierarchical DBSCAN 10 with an aim to identify the group of patients undertaking similar clinical activities and incurring similar healthcare costs, and (2) each group is then modelled by means of a Markov chain 11 delineating the probability distributions of transitions between different clinical actions. A new distance measure is also proposed to measure the similarity of the treatment patterns of the patients during clustering.
The prediction phase, concerned with prediction of the healthcare cost of the sequence of clinical actions of a new patient's treatment, also encompasses two steps: (1) first, for each group, we compute a tentative cost of the new sequence by averaging the cost of its k-nearest neighbor 12 sequences in the group, (2) the final cost is obtained as a weighted sum of the cost estimated by each of the groups. The weights for each group are the likelihood of the new sequence to the respective group as assigned by the corresponding Markov chain.
The performance of the proposed healthcare cost prediction algorithm is evaluated with the economic information together with information of the clinical activities of the breast cancer patients obtained from the health administrative department of the public health care system of the Basque Country, Spain. A 10-fold cross validation is employed with the training dataset resulting the optimal value of k of k-NN as three in the present application with respect to the mean absolute percentage error (MAPE) 2 . Moreover, the proposed method results in an MAPE measure of less than 6% when half of the clinical records of a new patient is available, irrespective of the value of k. It substantiates the capability of the proposed stratagem for early prediction of healthcare cost. Experiments undertaken also reveal that the proposed algorithm outperforms its state-of-the-art contenders with respect to MAPE metric. The comparative analyses verify the significance of jointly considering the clinical activity and the associated cost data to effectively capture the clinical records of patients for accurate healthcare cost prediction as early as possible.
The paper is divided into following sections. Second section delineates the proposed method of healthcare cost prediction. Experiments undertaken and the results are reported in third section. Fourth section concludes the paper.

Method
Data transformation. This section refers to transforming the database of individual patient's treatments into a series of actions, sorted by time. Here, we provide some definitions which will be used throughout the paper to develop a solution to the healthcare cost prediction problem.
Definition-1: Action. Let X be the set of all clinical activities, including diagnosis, procedure, medicine and the like, Y ∈ R be the set of all possible incurred healthcare cost as recorded in the database and T be the set of visiting times of the patients to the hospital. An action, say a, is then expressed as a three-tuple, given by Definition-2: Patient's treatment. A patient's treatment is defined by a sequence of its corresponding actions, sorted by the visiting time. Symbolically, a patient's treatment P is represented by where a i = (x i , y i , t i ) represents the action encompassing the clinical activity x i ∈ X and its respective healthcare cost y i ∈ Y incurred during visiting time t i ∈ T of the specific patient. For sake of simplicity of readers, we drop the notion of visiting time and hence a i now can be simplified as The clinical actions of P in (2) are chronologically ordered. Evidently, if i < j , a i occurs before a j . A sequence of actions of a patient's treatment is used to jointly track the progression of its activity-outcome and the corresponding healthcare cost over time. The length of the sequence varies across patients because of the diversity in their treatments over time.
Definition-3: Modified cost. Intuitively, the number of possible actions for all patients in the database is huge due to infinite number of healthcare cost elements in Y. For the sake of simplicity, Y is reduced to a finite set in a two step procedure described below.
(1) Discretization: First, the entire range of Y is discretized into n s segments defined by the n s -quantiles of Y. In other words, we set the lower and the upper limit of the i-th segment respectively to the (i − 1)-th quantile and the i-th quantile of the healthcare cost elements for all possible clinical activities, recorded in the database. (2) Quantization: Then a real healthcare cost element, lying in the i-th segment is replaced by the mean value of all cost elements of the i-th segment.
(2) P = (a 1 , a 2 , . . . , a n ) Clustering patients' action-sequences. It is noteworthy that patients undergoing various clinical activities reveal considerable diversity of their corresponding cost information. Hence, prior to predict cost of a new action-sequence, we cluster the action-sequences of the existing patients into groups. We then consult the cost information of the specific group of patients providing the maximum similarity with the action-sequence of the new patient to predict the respective possible future cost.

Scientific Reports
Two significant issues to categorize the patients based on their action sequences include: (1) design of an appropriate distance measure to capture the similarity between action-sequences of varying length, and (2) selection of an efficient clustering algorithm to ensure that action-sequences within a group are similar to each other than those in other groups.
Design of distance measure. There exists plethora of literature on using edit distance 13 to measure the dissimilarity of two strings of characters (or words). Given two strings S 1 and S 2 over a finite alphabet, an edit distance ED(S 1 , S 2 ) between S 1 and S 2 can be defined as the minimum cost of transforming S 1 to S 2 through a sequence of weighted edit operations. These operations primarily include insertion, deletion, and substitution of one symbol by another. Usually, the edit operations are assigned with equal weights of unity. Nevertheless, the string in this paper denotes the action-sequences.
However, there is a major limitation of using the conventional ED directly in the present context. The conventional ED compares two strings of characters (or words) only. In the present work, the components of the string (or action-sequence) is not only representing character (symbolizing a clinical activity) but an activitycost pair. Hence, application of the conventional ED in the present scenario captures the difference between two action-sequences based on their respective clinical activities only, ignoring the corresponding healthcare cost information. It thus loses the cost information and the temporal relationship of the activity-cost pairs over time.
Consequently, the clusters of patients based on the conventional ED measures identify patients ensuring similar clinical activities only. Evidently, the accuracy of the healthcare cost prediction based on the clusters, thus formed, is reduced to great extent. It has motivated us to design an appropriate distance measure to jointly capture the dissimilarity of two clinical activities (of two different action sequences) and their respective healthcare costs.
The proposed distance measure, referred to as treatment pattern difference (TPD) is an extended version of the conventional ED. In case of the conventional ED, all possible edit operations are associated with equal cost of unity. In TPD, the edit costs are modified as follows to consider the healthcare cost components of two action-sequences.
Let P 1 and P 2 be two different action-sequences. The cost of insertion of a clinical activity x i (or a character) to convert P 2 to P 1 is given by where y i denotes the healthcare cost of the clinical activity x i at the visiting time t i in the action-sequence P 1 . Similarly, the cost of deleting an action x j from P 1 to covert it to P 2 is given by where the symbols carry their usual meanings. If the clinical activity x i of P 1 is substituted with a different clinical activity x j of P 2 , the corresponding edit cost is given by Here ǫ is a small positive constant. It is used to ensure that even when y i = y j for x i = x j , at least C 3 = ǫ is used as the edit cost for substitution of x i by x j . It is noteworthy that if x i = x j , the conventional ED gives a zero penalty. However, there are instances of different healthcare costs for the same clinical activity of two different patients. To capture this, TPD uses an additional edit cost, given by Hence, the total edit cost to convert an action-sequence P 1 to another action-sequence P 2 is given by Here, w 1 and w 2 denote the weight for the edit operations respectively for different and similar activities. Intuitively, w 2 < w 1 as it corresponds to the penalty corresponding to similar activities with different healthcare cost. After a wide experimentation, we set w 1 = 0.7 and w 2 = 0.3 . An example of evaluating the dissimilarity of two action-sequences based on the TPD measure is presented in Fig. 2.
Selection of clustering algorithm. The TPD measures of each pair of patients' treatments in the given record are used to cluster the similar sequences in the same subgroups. The hierarchical density-based spatial clustering of applications with noise (hierarchical DBSCAN) algorithm 10 is employed to identify the groups of patients' treatments. The selection of DBSCAN in the present context is justified because of its merit of clustering similar data points (here, the action-sequences of patients) into same groups based on the density (number of nearby neighbors) without prior setting of the number of clusters. Moreover, unlike the traditional partitioning algorithms, DBSCAN can be applied for clusters of arbitrary shape, even when the data may be contaminated with noise 14 .
It is however worth mentioning that the huge economic database includes clusters of records of patients characterized at different density levels. The traditional DBSCAN algorithm with a single global density threshold often fails to effectively identify such clusters. This impasse is overcome here by using the hierarchical DBSCAN, proposed in 10 , which discovers all DBSCAN-identified clusters for an infinite range of density thresholds. Finally, it identifies a simplified hierarchical structure of significant clusters only.
Markov chain representation of a cluster. This step is concerned with representing each cluster of patients' action-sequences by a Markov chain 11 . The crux of such representation is founded on the underlying premise that the medical practitioners take their decision based on the previous clinical activities. Again, our cost prediction algorithm greatly relies on the recorded action-sequence of a patient. www.nature.com/scientificreports/ A first order Markov chain exhibits memoryless property where the current state only depends on the previous state. Let N be the possible number of actions (activity-cost pairs) in the database. The Markov chain model of a group of patients, say G l , is then demonstrated by a state-transition probability distribution, which is denoted as: Here q i,j,l and p l (x t+1 = a j |x t = a i ) respectively denote the number of cases and the probability of transition from the current action x t = a i to the immediate next action x t+1 = a j in the specific group G l of action-sequences. Evidently, it satisfies In addition to M l , we also evaluate the initial probability p l (a i ) of action a i considering all the action-sequences in the group G l for i = 1, 2, . . . , N as follows.
Here s i,l denotes the number of action-sequences initiated with the action a i in G l for i = 1, 2, . . . , N . This entire process is repeated for all groups identified by the hierarchical DBSCAN.
Cost prediction of a patient's treatment from action sequence. The aim of this step is to predict the possible total cost of a patient from the respective action-sequence. The action-sequence of the patient is formed following the principle given in "Data transformation" section. Let the ordered sequence of actions of the new patient's treatment be denoted by P = (a 1 , a 2 , . . . , a n ) where the action a i represents the activity-cost pair at the visiting time instant t i . The prediction of future cost based on P is undertaken in three phases.
Phase-1: cost estimation of P based on a specific group. We employ k-nearest neighbor (k-NN) to identify k action-sequences from a group, say G l , that offer maximum similarity with P based on TPD measure as given in (8). First, we compute the TPD values between P and each member sequence of the group G l . The member sequences are then sorted in ascending order of their TPD measures thus evaluated. The first k members are selected as the k nearest neighbors of P. Next, each of the k members is assigned a weight w j,l , inversely proportional to its TPD measure from P for j = 1, 2, . . . , k . Consequently, the total cost ĉ l (P) of the new actionsequence P estimated by the group G l is given by Here c j,l denotes the total cost incurred by the j-th nearest neighbor of P in G l for j = 1, 2, . . . , k . ĉ l (P) is computed for all clusters of patients identified by the hierarchical DBSCAN.
Phase-2: Evaluation of the likelihood of P to patients' groups. This step is concerned with evaluating the likelihood of P to each subgroup of patients based on the respective Markov chain model. The likelihood of the ordered sequence of actions P = (a 1 , a 2 , . . . , a n ) to a specific group G l is given by Here a 1 denotes the initial action of P and a i represents the action of P occurred at visiting time t i for i = 1, 2, . . . , n . Evidently, p l (a 1 ) and p l (a i+1 |a i ) respectively symbolize the initial probability of action a 1 and the probability of transition from the current action a i to the immediate next action a i+1 of P as described by the group G l . Expression (14) is evaluated using the Markov chain model M l representing the group G l .
After evaluating l (P) for all groups, the normalized likelihood of P to each subgroup is computed using Phase-3: Cost prediction based on all groups. After evaluating the estimated cost and the normalized likelihood of P to all groups, the total cost of P is finally predicted following  To validate the proposed method of cost prediction, the present work considers the pool of breast cancer patients only. The selection of breast cancer patients from the database conforms the International Statistical Classification of Diseases and Related Health Problems (10-th revision) 15 , stating that every code starting by C50 corresponds to breast cancer diagnosis. A few filtering steps are then carried out following 16 to judiciously select the pool of patients of interest. The filtering process affirms that the selected patients have their complete treatment in the above-mentioned time period of two years. Following the medical guideline, a final set of 972 patients is identified. 70% of the entire database is ultimately used as the training dataset, while the remaining as the test data. A 10-fold cross validation is employed on the training dataset for judicious selection of the value of k for k-NN.The proposed method is implemented using MATLAB 2019a.

Identification and representation of patients' action-sequences. The final record of the 464
patients consists of 23 unique clinical activities as described in Table 1. The healthcare cost is next discretized into n s segments. In Fig. 3, we present a plot of normalized quantization error values for different settings of the number of quantiles n s , varied from 2 to 12 to check a significant improvement in performance. The normalized quantization error (NQE) is given by (17).
Here c(i) and c m (i) respectively denote the true and the modified i-th healthcare cost (after discretization) of the database with N c cost elements for i = 1, 2, . . . , N c . Figure 3 reveals that the quantization error is reduced with an increase in the number of segments n s . However, it is also observed that there is no significant change in the error for n s ≥ 8 . We have thus fixed n s = 8 . It is worth mentioning that the setting of n s here is biased to the healthcare cost values of the present database. The quantization of the healthcare cost range of the present database using 8-quantiles ensures a balanced number of healthcare cost elements in each of the eight cost-segments.
Next, the healthcare cost of all clinical activities of 464 patients is discretized in eight segments based on 8-quantiles of the healthcare cost range, as demonstrated in Fig. 1. Let the segments (sorted in ascending order)   Table 2.
The hierarchical DBSCAN algorithm is then employed on the training dataset to cluster the sequences using TPD values. The algorithm results in eight clusters. The clusters thus identified are pictorially represented in Fig. 4. The descriptions of the actions of the sequences, shown in different colors, are tabulated in Table 2. Each cluster is then described by a Markov chain following "Markov chain representation of a cluster" section.

Performance evaluation of proposed healthcare cost prediction method. Performance met-
ric. The performance of the proposed cost prediction algorithm is evaluated with respect to mean absolute percentage error (MAPE) with a lower error indicating a better performance.
Here c(P i ) and c(P i ) (evaluated using (16)) respectively represent the true and the predicted cost of the i-th patient's treatment P i in the validation or the test dataset with N t records for i = 1, 2, . . . , N t .

Validation of earliness prediction and selection of k of k-NN.
The capability of the proposed algorithm to predict the possible total healthcare cost of patients is verified by varying the length of sequence of the recorded treatments of the patients from 20 to 100%. The appropriate selection of k (of k-NN) for the optimal performance is undertaken using 10-fold cross validation on the training dataset. The MAPE values (averaged over 10 folds of the training data) for different settings of k and percentage of length of sequence of the recorded treatments of the patients are tabulated in Table 3. Table 3 reveals that the longer the length of the sequence, the better is the prediction accuracy with smaller MAPE measures, irrespective of the setting of k. The optimal performance of the method is obtained for k = 3 with the entire sequence information. It is also noted that an MAPE smaller than 6% is obtained even when 50% of a visit sequence is utilized. It proves the effectiveness of the proposed method for an early prediction of the healthcare cost.
Next to check the variability of the MAPE measures obtained by the proposed method with k = 3 , 20 experimental runs are undertaken with 10-fold cross validation of the training data. The samples of each of the 10-folds of 20 runs are randomized. The results are summarized in Fig. 5. Figure 5 reveals detection of outliers when the length of the action sequences (i.e., squence of recorded treatments) is considerably small. The mean and standard deviation of the MAPE values obtained by the proposed method over 20 experimental runs, each with 10 folds, are reported in Table 4.
Comparative performance analysis. The next experiment aims at comparative performance analysis of our proposed algorithm. Three state-of-the-art techniques are considered in the comparative framework, including gradient boosting (GB) 17 , artificial neural net (ANN) 18 and elastic net (EN) 19 . These existing methods have utilized the healthcare cost data only to predict the future cost 2 . The hyperparameters of the competitive algorithms are tuned using the grid search method and the 10-fold cross validation of the training data. The tuned hyperparameters are reported in Table 5.
The comparative analysis of performance of the competitors is undertaken next. To ensure fair comparison of all contenders, each algorithm is evaluated on the same 10-fold cross validation split of the data and the same random number seed is used to split the data in each case. The MAPE measures (averaged over 10 folds) www.nature.com/scientificreports/ obtained by all four contender algorithms using training data are tabulated in Table 6. Table 6 reveals that the proposed algorithm outperforms its contenders by achieving the minimum MAPE measure in most of the cases. GB outperforms the proposed method in two cases, where the lengths of action sequence are 20% and 30%. The results of Table 6 are further used to carry out the hypothesis test to verify the statistical significance of the difference in performance of the proposed algorithm and each of its three contenders. Assuming no specific distribution of the population of MAPE values (obtained after 10-fold cross validation for each algorithm), the Friedman non-parametric test 20 is undertaken on the mean values of MAPE metric obtained by the contender algorithms (over 10-fold cross validation) with a level of significance α = 0.05. The Friedman ranks obtained by the contender algorithms based on the results given in Table 6 are reported in Table 7. The results reported in Table 7 also designate the proposed method as the best algorithm. The test considers the null hypothesis that there is no significant difference between the performances (based on the mean MAPE measures) of the competitive algorithms. www.nature.com/scientificreports/ respective critical value of 7.815 following χ 2 F distribution with 3 degrees of freedom at α = 0.05. It substantiates statistically significant difference between the MAPE measures obtained by the proposed algorithm and each of its contenders.
The results in Table 7 highlights the proposed method as the best algorithm, so the Bonferroni-Dunn posthoc analysis 20, 21 is performed with the proposed method as the control method. For the Bonferroni-Dunn test, a critical difference value is calculated which for these data (represented by Table 7) comes as 1.457. A significant         Table 8. Statistical values obtained by Friedman test based on Friedman ranks given in Table 7. www.nature.com/scientificreports/ difference between the performances of the control algorithm and its contender is inferred if their corresponding Friedman ranks differ at least by a critical difference. Pictorially, it is shown in Fig. 6. It is evident that the performances of ANN and EN in the present context are significantly inferior to the proposed method. Finally, following the inferences from the non-parametric statistical test, the Hochberg multiple comparison test is further undertaken 21 with the proposed method (achieving the best Friedman rank) as the control algorithm. The adjusted p-values are reported in Table 9. It is evident from Table 9 that the test infers that there is no statistically significant difference between the performances of the proposed method and GB with the respective adjusted p-value exceeding α = 0.05 22 . However, the null hypothesis is rejected for the remaining cases of comparing the proposed method with its competitor algorithms with an adjusted p-value smaller than α = 0.05. Table 10 reports the MAPE measures for the same competitors for the test data. The reported results are pictorially presented in Fig. 7. The reported results substantiate that our proposed method overcomes its contenders with GB acquiring the second rank. It in turn validates the efficiency of jointly considering the clinical activity and the associated cost data for the healthcare cost prediction.

Conclusion
The paper presents a novel method to predict healthcare cost of breast cancer patients as early and accurately as possible. The early prediction capability of the proposed method is used for identifying patients at risk of becoming high-cost healthcare users, before incurring substantial avoidable costs. The merit of the paper lies in the following counts. First, it considers the clinical activity and the associated healthcare cost data jointly to model the treatment of a patient. Second, it recommends a novel distance measure to capture the dissimilarity of two treatment patterns, encompassing both clinical activities and healthcare cost information. Third, it employs the hierarchical DBSCAN to categorize patients into different clusters with an aim to effectively identify the high-need and/or high-cost patients. Fourth, each cluster of patients is depicted by a Markov chain model to   www.nature.com/scientificreports/ mathematically represent the treatment patterns. Finally, the Markov chain models of all the clusters are used to predict the possible future (total) cost of a patient's treatment. The performance of the proposed algorithm is compared for different length of sequence of the recorded treatments of patients. The experimental results reveal that the method achieves an MAPE value, as small as 6% even with half of the clinical records of a patient. Experiments undertaken also substantiate the superiority of the proposed algorithm to three state-of-the-art techniques which utilize only the healthcare cost data of the patients for prediction. As a continuation of the present work, we first plan to test our method on different databases from different healthcare organizations for patients suffering from different diseases. More experiments on different databases could help to take a deeper dive into the data and explore ways to obtain more solid evidence on the performance of the proposed method, irrespective of databases. Second, we may consider the socio-demographic information of the patients along with the clinical actions with an aim to be utilize their joint explanatory power to understand the root causes of patient' costs. Third, we have not exploited time feature in the present work. Intuitively, inclusion of time feature may effectively capture the differences of treatment patterns of patients and thus may enhance the prediction performance of the proposed method. Finally, appropriate stratagem needs to be developed to effectively balance the trade-off between the accuracy and earliness of the healthcare cost prediction.