A deep learning approach for facility patient attendance prediction based on medical booking data

Nowadays, data-driven methodologies based on the clinical history of patients represent a promising research field in which personalized and intelligent healthcare systems can be opportunely designed and developed. In this perspective, Machine Learning (ML) algorithms can be efficiently adopted to deploy smart services to enhance the overall quality of healthcare systems. In this work, starting from an in-depth analysis of a data set composed of millions of medical booking records collected from the public healthcare organization in the region of Campania, Italy, we have developed a predictive model to extract useful knowledge on patients, medical staff, and related healthcare structures. In more detail, the main contribution is to suggest a Deep Learning (DL) methodology able to predict the access of a patient in one or more medical facilities of a fixed set in the immediate future, the subsequent 2 months. A structured Temporal Convolutional Neural Network (TCNN) is designed to extract temporal patterns from the administrative medical history of a patient. The experiment shows the goodness of the designed methodology. Finally, this work represents a novel application of a TCNN model to a multi-label classification problem not linked to text categorization or image recognition.

www.nature.com/scientificreports/ Since 2012, the accesses to medical services within the public healthcare system in Italy are provided through a booking system in dedicated centers administered by the local health authority and controlled by the regional government. To access the booking center, a referral written by a practitioner is required; each referral includes a prescription to different provisions, like specialized medical examinations, medical therapy sessions, laboratory testing analyses, e.g. any venous blood sampling examinations, and diagnostic examinations. The data that we have analyzed in this research study comes from a distributed database serving various local health departments of Campania, Italy. In more detail, we have analyzed data generated within five years, from 1st January 2014 to 31st December 2018, of medical prescriptions and booking appointments, including cancellations and reschedulings, which in total amount to more than 13 million entries. This paper aims to exploit temporal administrative records to provide predictions on the possible medical examinations of a patient in the following two months, and in particular at which facility the appointment will occur. Thus, the model is linked to the prediction of patient distribution through the regional healthcare system. Therefore, our problem is a multi-label problem, because an appointment booking at one facility does not exclude the possibility of another appointment at a different facility. In this work, the number of facilities we focused on is fixed to 10. Our ML approach consists of a deep learning methodology based on a Temporal Convolutional Neural Network (TCNN), adapted to the more complex case of patient information developing over time. The output of the proposed method is, for each facility under consideration, the possibility that the patient will have an appointment at that facility, expressed as a percentage. The performance of our model has been verified by considering several metrics in a comparison with other methodologies suited to the solution of multi-label problems. Moreover, in the latter part of the work, an additional service is presented: a lower bound for each facility patient attendance in the subsequent 2 months.

Data preprocessing
The main table of the database has more than 13 million unique rows, referring to booking appointments and medical prescriptions covering the period 2014-2018. Each record stores insights about the patient (gender, age at the time of prescription, etc), the practitioner, the appointment [date, medical facility, health service (HS) provision, etc], and the referral (prescription date, number of prescriptions, etc). Moreover, code correspondence tables contain details about each medical facility (e.g. location) or HS provision (e.g. a list of medical branches that can be associated together).
Data cleaning. Due to the presence of outliers in the data set (e.g. test referrals with more than one patient and/or multiple practitioners), an operation of data cleaning has been carried out: (1) all records containing 'unknown' referrals, prescriptions, locations and medical facilities (due to unrecoverable errors in the encoding of the identification string) were removed; (2) only appointments labeled 'Valid' were maintained; (3) only patients with at least five appointments throughout the reference period were maintained; (4) all invalid records containing zero appointments were removed; and (5) all records having negative days of waiting between the date of registration and the date of appointment were removed. This operation reduced the number of records from around 13 million to around 8.4 million and the number of patients from around 1.6 million to around 500,000.
Data analysis. Firstly, we analyzed the number of unique occurrences of each interested entity.
As can be observed from Table 1, the data set is characterized by a high heterogeneity. Therefore, we investigated the frequencies to obtain a better insight into the entity distribution. Regarding the appointment dates, the histogram in Fig. 1 shows that there is a significant number of appointments in any considered year or month, a fact which led us not to be concentrated only on the data of the previous years.
Concerning the HS provisions Table 2a records the first six, ordered by frequency. From Table 2b it can be observed that more than 70% of the HS provisions (1,815 out of 2,521) have an overall frequency lower than 1%. However, it is also clear that considering only one hundred results in a significant information loss. Thus, our data set is also characterized by sparsity.

Data manipulation.
After the data analysis, the following features were extrapolated for each patient: gender, birth year (estimated by subtracting the year of the prescription date and the patient age for each record and taking the minimum value), and an appointment list in terms of HS provision and medical facility for each date. To incorporate temporal information, inspired by the work of Cheng et al. 8 , we chose a matrix representation Table 1. The table records the totals of unique occurrences of each entity of interest, constructed in order to obtain a first insight of the data set. For example, there are more than 2,000 different HS provisions and nearly 400 distinct medical facilities. www.nature.com/scientificreports/ with regular time intervals; since more than 97% of the patients never had more than two distinct appointments in the same week, we selected the weeks as time steps (number of weeks: 262). For each patient, two matrices in a sparse form, named P and F , were generated, with the time step as the row index, with the HS provision id and medical facility id as the column index, respectively. Therefore, for each patient, we built two matrices P ∈ N 262×2382 0 and F ∈ N 262×369 0 , where for example P 88,54 is the number of appointments in the 88th week of the period 2014-2018 relating to the 54th HS provisions, F 133,207 is the total of appointments in the 133rd week booked in the 207th medical facility. Since our interest was to build a model which considers only the previous twelve months of the patient's medical history, we decided to adopt the following approach to exploit the entire data set as effectively as possible: for each patient, any possible 52-week temporal window of his medical history (1st-52nd week, 2nd-53rd week, etc) was extracted; only if there were appointments in at least four distinct weeks of this period and if there was at least one appointment in the following 2 months (9 weeks), was the window kept and the sub-matrices of P and F relating to the same period stored. Holding the same notation, we have P ∈ N 52×2382 0 , F ∈ N 52×369 0 and F ∈ N 9×369 0 (relating to the following 2 months, used to create the labels) for each window. Due to the high dimensionality and sparsity of the data, we made the following considerations. At first, we gathered the corresponding medical branches of each HS provision into a few groups. However, unfortunately, we checked that an HS provision can be associated with different branches, and therefore this kind of collection does not result in a partition of the HS provisions set. Therefore, we constructed some statistical features of the HS provisions to cluster them into eight clusters (the number of clusters was determined by using nbClust routine 9 , the clustering was performed by using K-means). Regarding the medical facilities, we sorted them by the amount of the appointments collected in the F matrices; due to sparsity, this step aims to classify  www.nature.com/scientificreports/ samples with not extremely unbalanced labels. We considered the top 10 medical facilities as the targets for the multi-label classification problem; Table 3 illustrates them. We also used these ten facilities to reclassify the remaining 359 ones according to their location. In detail, the top 10 facilities are located in eight distinct districts; therefore, the others were rearranged into nine groups basing on their geographical position (eight groups which gather the medical facilities of the same district and a "rest-of-the-universe" group). Thus, the newly generated data set had about 7.1 million records, each one consisting of seven features: (1) patient id, (2) window progressive id, (3) gender, (4) age at the first week of the window, (5) matrix P ∈ N 52×8 0 (HS provision cluster id as column index), (6) matrix F ∈ N 52×19 0 (an integer from 1 to 10 + 8 + 1 as column index, representing top 10 ranked facilities, the groups of facilities for each of the eight districts and the "rest-of-the-universe" group), (7) matrix F ∈ N 9×19 0 . For example, P 63,4 is the number of appointments in the 63rd week of the period 2014-2018 relating to the HS provisions of the fourth cluster, F 126,7 is the total of appointments in the 126th week booked in the seventh medical facility, and F 174,15 is the number of appointments in the 174th week booked in a facility of the fifth district.
Next, since the features have different ranges, a scaling operation of data into the interval [0, 1] was required: gender feature ( 0 = male , 1 = female ) without any need for further mapping; age was scaled with a min-max normalization ( min = 0 , max = 120 ); and for the P and F elements scaling, we introduced a non-linear parametric scaling ϕ to emphasize the differences between the low numbers rather than between the high ones. In particular, the map defined was ϕ : Fig. 2 shows how the graph of ϕ depends on the parameter while the Table 4 illustrates the mapped values of the first integers for the chosen value of .  www.nature.com/scientificreports/ Data labeling. Based on F , each window was labeled with a vector y ∈ {0, 1} 11 , as follows: , . . . , 19} ; otherwise y 11 = 1 . According to this definition, for j ∈ {1, . . . , 10} the j-th label is equal to 1 if the patient had at least one appointment in the j-th facility in the following nine week period; otherwise, the label is 0. The 11th label was added only because the proposed method needs to handle non-zero label vectors.
In conclusion, the final data set handled by the classification methods is summarized in Tables 5 and 6.

Methods and metrics
This study aims to provide predictions on the possible appointments of a patient at any considered facility in the subsequent 2 months, starting from the patient's clinical history from the previous year. In particular, to validate the proposed predictive model, we made a comparison with some traditional methods, with particular attention to the multi-label nature of the problem. In any case, where a Neural Network is deployed, the adopted back-propagation strategy is based on the BP-MLL method 10 Table 6. The output columns of the provided final data set for multi-label classification. www.nature.com/scientificreports/ due to the known fact that their internal memory mechanisms do work very well only on consequential data without large "holes", i.e. there are not large portions of the input filled with zeroes. Furthermore, we did not use other known traditional Machine Learning methodologies like Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) because they are not naturally suited to multi-label classification problems without having to use strategies like one-vs-one or one-vs-all.
proposed method. The design of our proposed method has been done to explore and extrapolate information from the temporal evolution of the medical booking history of the patients; a previous work, done by Cheng et al., proposed Temporal Fusion Convolutional Neural Networks (TFCNN) 8 study the clinical history as a matrix. These Neural Networks encode the input matrices into smaller but more meaningful features with a concatenation of temporal discrete convolution and pooling operations; then, such reduced features are passed to a fully connected Neural Network, to non-linearly combine them. Such networks have been applied with good results to chronic obstructive pulmonary disease (COPD) risk prediction 18 , where, similarly to this case, the input is composed of the clinical history and other features that are characteristic of the patient. Due to the particular characteristics of our data set, Temporal Convolutional Neural Networks have to be used. The difference between the aforementioned work and this work is that our data set does not contain any clinical history, but only bookings to provisions, therefore a careful adjustment to the model must be done. The result of this operation is a feedforward neural network model composed of a structured temporal convolutional layer and a multilayer perceptron. In detail, the Temporal CNN (TCNN) block takes as input a t × d matrix ( t = 52 and d = 8, 19 in our application) and a set S of n S sizes as parameters ( S = {6, 9, . . . , 51} in experimental tests). For each s ∈ S , 64 filters with a s × d kernel operate on the input matrix to discover different patterns, obtaining 64 t × 1 arrays to which a 2 × 1 pooling is applied. At this point the t 2 × 1 are added to produce a single vector. The n S vectors obtained by varying s ∈ S are then concatenated. Hence, the output of the TCNN block is an n S · t 2 column vector. Figure 3 summarizes the above process. The convolutional layer of the proposed method is composed of four TCNN blocks, to apply a maximum or an average pooling to each temporal matrix. Next, by concatenating the output column vectors also with the gender and age features, the vector of 4 · n S · t 2 + 2 components becomes the input of a feedforward fully connected neural network with two hidden layers (respectively of 128 and 32 perceptrons) and an output layer composed of 11 units, which generates the final output of the method. Figure 4 shows the pipeline described. evaluation metrics. Given that this study focuses on a multi-label classification problem, the evaluation of the models will take into account two types: by using a cumulative metric for all the labels, or by applying a single metric for each label. In the former methodology, three metrics were used for the comparison: the Hamming Distance, the Exact Accuracy and the Top 10 Facilities Exact Accuracy, while for the latter methodology, several metrics (accuracy, precision, recall, F1-score, and AUC) were estimated for each label. Hamming distance d H represents the average number of missed labels for each window, i.e. by the following formula: Figure 3. The proposed network model. The boxes in green represent the blocks of information in the input, i.e. P , F , gender, and age, where P is a temporal matrix referring to the HS provisions and F is a temporal matrix referring to the facilities. Gender and age are placed away from P and F to highlight the need to first extract hidden patterns from P and F . In detail, from the left side: P and F are each passed through two parallel TCNN blocks, where each block is characterized by its pooling layer. The outputs of the four blocks are concatenated together with gender and age and passed through a fully connected neural network with two hidden layers. Each circle represents a dense neuron and the links provide the feedforward propagation from the previous neuron to the next neuron. where an exact prediction occurs if the vector of the predicted label ŷ is equal to the vector of the true label y. Finally, Top 10 Facilities Exact Accuracy is defined as the Exact Accuracy of the predicted output when the last label is ignored, since it is just a workaround for the implementation of the BP-MLL algorithm for our proposed model.
For a fixed label, from the confusion matrix: where TP is the true positive, FP is the false positive and FN is the false negative. The AUC score is defined as the area under the ROC curve, used to measure how well the model can distinguish between two classes, even more so when the labels are unbalanced.  www.nature.com/scientificreports/

Results
In this section, our experimental results are shown. Firstly, the whole data set was considered, but only 37.5% of windows had at least one of the first ten labels equal to 1, i.e. at least one appointment in any of the facilities which we focused on. Therefore, before each run, we randomly reduced the remaining 62.5% of windows to a seventh of their number to increase the percentage of the positive cases for each label. Table 7 shows how the frequencies changed.
In each run, this new data set was split into a training set (90%) and test set (10%) for RF, while it was split into a training set (72%), validation set (18%), and test set (10%) in the case of MLP and our proposed method. Regarding this different rule of splitting, we clarify that the validation set is only used in the early stopping criteria related to the Adam parameter optimization algorithm 19 . In fact, due to long execution times, we did not apply any kind of semi-automatic search algorithm; in a previous testing phase, the hyperparameter tuning of each model was performed by analyzing the results of several test runs. After the testing phase, we evaluated the considered methods by performing 15 different runs for each of them, differing for the simple randomization of the samples in the data set before each splitting. Tables 8 and 9 show that the proposed method outperforms the traditional method according to any of the considered metrics. Each listed measurement is in the form mean ± standard deviation, calculated from the results of all the runs. RF and MLP miss on average one every 14 labels, while the proposed method misses on average less than one every 26 labels ( 11/0.42087 ≈ 26.14).
In particular, the Overall Recall referring to our method (Table 9) reveals that even though the label frequencies are very low, the proposed model is capable to recognize the positives with a better accuracy than the other tested methods. Further investigations were carried out for each facility. Table 10 shows the accuracy of our method referring to each facility; it can be observed that the model produces a very high accuracy for every label. www.nature.com/scientificreports/ Once we had gathered evidence about the quality of the proposed model, we investigated cases of patients who had not had any appointment in a particular facility j in the 52 weeks but also had had an appointment in the same facility j in the following 9 weeks. In particular, the aim was to understand the capability of the proposed model to predict the first access of patients to a facility. Table 11 shows the accuracy with reference to each facility; the mean accuracy on this specific set of patients is over 64% ( ±0.0156%).
As described above, starting from the previous year's clinical history the proposed method attempts to predict if and in which facility a patient will have at least one appointment. By this feature, this method may be useful to the health authority, because it can provide a non-trivial lower bound on the facility patient attendance in terms of patient access. Such comparison has been done by running the proposed model 15 times in the same way as in the results with the other models, and then we compared the total of predicted positives with the real number of accesses to facility present in such test set. This has been done to keep the coherency of the statistical analysis. Figure 5 illustrates a single run of the comparison, but if we consider the average and the standard deviation of all the runs, we obtained that the difference between the lower limit and the real amount is 16 ± 4% of the latter. Table 10. Accuracy of the proposed model for each facility. The mean accuracy on the test set is over 97%, a further measure of the validity of model. Even if it is out of our focus, we also verified the accuracy on the 11th label and this proved to be similarly high (over 85%). Accuracy 0.973 ± 0.001 0.972 ± 0.001 0.973 ± 0.002 0.981 ± 0.001 0.971 ± 0.002 Table 11. The table shows how the proposed method predicts the number of accesses of patients to a facility where they have never previously been; the overall accuracy is over 64%. www.nature.com/scientificreports/ conclusions Healthcare technologies are not necessarily confined to patient management and care but can also have a great significance concerning prevention. The greater is the emphasis placed on preventive medicine rather than the mere treatment of the symptoms, the more effective will the healthcare system be for everyone, both for the healthcare management and the wider community. In this paper we have presented a Temporal Fusion CNN adapted for the multi-label problem described. Starting from only administrative information about the previous year's clinical history of a patient, the goal of the model has been to predict if and in which facility the patient will book an appointment. Even though the administrative data do not store information about medical examination outcomes or reports, the experimental results prove the quality of the model constructed. As future research, with additional information about patients and HS provisions, it would be interesting to evaluate the model predictions for patient health, not only from that of administration. In the latter part of the work, we have proposed a patient attendance lower bound for each facility as a service to the health authority. As a further future research project, supported by supplementary financial information, it may also be possible to provide statistical data to optimize the funding distribution of the regional health system.