Warfarin maintenance dose prediction for Chinese after heart valve replacement by a feedforward neural network with equal stratified sampling

Patients requiring low-dose warfarin are more likely to suffer bleeding due to overdose. The goal of this work is to improve the feedforward neural network model's precision in predicting the low maintenance dose for Chinese in the aspect of training data construction. We built the model from a resampled dataset created by equal stratified sampling (maintaining the same sample number in three dose-groups with a total of 3639) and performed internal and external validations. Comparing to the model trained from the raw dataset of 19,060 eligible cases, we improved the low-dose group's ideal prediction percentage from 0.7 to 9.6% and maintained the overall performance (76.4% vs. 75.6%) in external validation. We further built neural network models on single-dose subsets to invest whether the subsets samples were sufficient and whether the selected factors were appropriate. The training set sizes were 1340 and 1478 for the low and high dose subsets; the corresponding ideal prediction percentages were 70.2% and 75.1%. The training set size for the intermediate dose varied and was 1553, 6214, and 12,429; the corresponding ideal prediction percentages were 95.6, 95.1%, and 95.3%. Our conclusion is that equal stratified sampling can be a considerable alternative approach in training data construction to build drug dosing models in the clinic.

www.nature.com/scientificreports/ nonlinear relationship between warfarin dosage and INR response is required, which could also be non-trivial for physicians but can be possibly formulated as computational models. Machine learning and artificial intelligence methods, in particular, artificial neural networks, have been making a significant impact on cardiovascular medicine 9,10 . Artificial neural networks are data-driven, i.e., network parameters are "learned" by a "training" process from a dataset. When a neural network has several layers, it can produce nonlinear representations of the underlying data structure. If sufficient retrospective records of patients' warfarin therapy history are available, it is possible to set up a computational model of neural networks to represent the relationship between the dose requirement and patients' characteristics.
Attempts have been made in the community to develop computer-aided warfarin dosing strategies and have shown promising results [11][12][13][14] . Existing approaches based on neural networks demonstrate their reliability in predicting warfarin maintenance doses in patients with medium-dose requirements but performed poorly for patients requiring a high dose and even worse for patients requiring a low dose [15][16][17][18] .
Neural network parameters are learned from data, and therefore the performance can be affected by the training dataset. In a typical clinical setting, the population of patients that require medium warfarin dose is often the largest, and the population requires a low dose is the smallest. Such a real-world dataset, if used directly, would possibly lead to neural network models that perform more accurately in the medium-dose range and less accurately in the low-dose range and high-dose range. Chinese are more sensitive to warfarin 19,20 . The precision of warfarin maintenance dose prediction is of great clinical significance, especially for patients whose requirements potentially lie in the low-dose range.
In this work, we assess the possibility of improving the low-dose warfarin prediction accuracy of a feedforward neural network by "designing" its training dataset. We build the training set via stratified random sampling; that is, we alter the dosage distribution in the training dataset such that the low, intermediate, and high daily dose are in equal proportions. Then we train a feedforward neural network by gradient descent using this resampled dataset. We detail our approach in "Methods". Section "Results" is devoted to experimental results on a large multicentre database. A brief discussion and concluding remarks are given in "Discussion".

Methods
This study's protocol was approved by the Ethics Committee on Biomedical Research of West China Hospital of Sichuan University with the number 2020 (556). The study protocol was performed in accordance with the relevant guidelines. The informed consent was waived by the Ethics Committee on Biomedical Research of West China Hospital of Sichuan University with the number 2020 (556), given that this was a retrospective study.
Participants. The raw data was from a multicentre (45 hospitals) database, "Chinese Low-Intensity Anticoagulant Therapy after Heart Value Replacement" (CLIATHVR), which consists of demographic information and regular anticoagulation monitoring records of 28,239 patients that underwent heart valve replacement and received warfarin anticoagulation therapy. The database was constructed and continuously updated from January 1st, 2011 to June 24th, 2016 (approved by the Ethics Committee of West China Hospital of Sichuan University with the number ChiECRCT-2011006).
We applied the following inclusion and exclusion criteria and used the resulting 19,060 patients' records. Inclusion criteria: (1) age over 18 years; (2) after heart valve replacement (including bioprosthetic valve and mechanical valve); (3) receiving oral warfarin for anticoagulant treatment only and daily INR monitoring (measuring INR once a day); (4) obtaining maintenance dose (fluctuation range of INR value was less than 0.2 units for three times in succession and the therapeutic range was 1.5-2.5) 21 .
Exclusion criteria: (1) severe liver (the value of ALT or AST > 320 IU/L for male; ALT or AST > 224 IU/L for female) and kidney (the value of creatinine > 442 μmol/L or the value of urea nitrogen > 17.9 mmol/L) dysfunction before or after operation; (2) receiving aspirin, non-steroidal anti-inflammatory drugs, or other drugs affecting coagulation function; (3) suffering anticoagulation complications (e.g., thrombus, embolism, hemorrhage, and death) during anticoagulation therapy.
Variable selection. We want to model the nonlinear relationship between warfarin maintenance dose and its influential factors. We applied the general linear model (GLM) Univariate procedure and used P-values and η 2 effect values as indicators to select potential and influential factors as input variables of the neural network model. The output variable was the warfarin maintenance dose, and its concrete values were identified when the fluctuation range of INR value is less than 0.2 units for three times in succession.
Training datasets and validation dataset. In this study, we prepared four subsets, Set A, Set B, Set C and Set D. Set A and Set D were for neural networks training. Set B and Set C were for internal and external validation. We first divided the eligible 19,060 cases into three datasets: Set A, Set B, and Set C. Set C was a holdout dataset for validation and consists of the latest 1906 eligible cases recorded 22 . The remaining 17,154 eligible cases were randomly divided into Set A and Set B at ratio 8:1, i.e., Set A of 15,428 cases and Set B of 1906 cases, achieved in R studio (R Pack Version 3.6.3, R Studio, R Core Team, 2014, Boston, MA, USA). Set C was selected directly according to the latest enrolled time without randomizing to create a set containing patients with different features with Set A and B.
And then, we resampled the Set A to construct a training Set D. In China, the general dose of warfarin is 2.5 mg/day. According to the suggestions of clinicians and published researches, the cut-off values of dose were defined as 2.5 mg/day ± 0.25 × 2.5 mg/day (1.875 mg/day and 3.125 mg/day) 15 Model construction. In this study, neural networks were built by the backpropagation algorithm, and all have three layers, i.e., one input layer of n nodes, one hidden layer of m nodes, and one output layer of l = 1 node. We denoted the neural networks trained from Set A as plain neural networks (PNNs) and those from Set D as stratified sampling trained neural networks (SSNNs). The number n is equal to the number of input variables. The number m is set empirically where α ∈ (0, 10) , a natural number, is a turning parameter. The optimal m and α are determined when we obtain the best prediction accuracy.

Model validation.
We compared the predicted warfarin maintenance dose with clinical data to demonstrate model performance. The comparing metrics were the mean absolute error (MAE), the mean square error (MSE), and the ideal prediction percentage (i.e., the absolute prediction error between predicted dose and the actual dose was within 20% of the actual dose).

Statistical analysis.
The t test was used to compare continuous variables in patient characteristics among different datasets, and the Wilcoxon rank-sum test was used when the variable did not meet the conditions of t test use. The χ 2 test was used to compare categorical variables. The Monte Carlo method was adopted if the χ 2 test criterion failed to meet (i.e., when more than 20% of the expected frequencies have a value of less than five, or the expected frequency was less than one). All statistical tests were two-sided, and P values less than 0.05 were considered statistically significant.  Table S1 for details).
The output variable was warfarin maintenance dosage. There were no statistical differences among the input variables in Set A and B ( P > 0.05 ), which is expected as cases in these two sets are all sampled randomly.
The differences in all the selected input variables, except APPT, were statistically significant in comparing Set C with Set A ( P ≤ 0.005 ). This is expected as all 1906 cases in Set C are in chronological order and were set aside from the eligible cases. And thus, we assume that Set C is representative and would sufficient for model validation. (More statistical details on the data are listed in Table 1).
Model construction. The learning rate, expected error, and training times were tuned to be 0.1, 0.001, and 1000, respectively. The final model structure (both for PNNs and SSNNs) is shown in Fig. 1. All neural network models built in our study had eight inputs ( n = 8 ) and one output ( l = 1 ); the optimal values of m , the number of nodes of the hidden layer, varied across different models (see Fig. 1  Model validation. The difference between internal validation dataset B and the external validation dataset C were not significant in either MAE or MSE. On dataset B, the difference of ideal predicted percentage between PNN and SSNN was not significant, nor did the difference of MSE or MAE, and all of them were less than 0.5 mg/day. Similar results were also observed on dataset C. The resampling process that we performed reduced the training dataset's size, but did not greatly affect the overall predicting precision (see Table 2).
A closer look at these results on Set B and C revealed that: (1) when cases lay in the medium-dose range, both PNN and SSNN predicted the best (ideal predicted percentage were around 90%); (2) when cases lay in the lowdose range, the dose predicted by PNN was overestimated (the prediction accuracy was 0.0% on Set B and 0.7% on Set C), and SSNN performed better than PNN (the prediction accuracy was 8.7% on Set B and 9.6% on Set C); (3) when cases lay in the high-dose range, prediction by SSNN was also superior, and there was an improvement in prediction, e.g., from 24.3 to 27.6% on Set B and from 19.8 to 24.4%. See Table 3 and Fig. 2 for details.
We further trained more PNNs individually on single-dose subsets of the eligible cases. The training sets of these models were originated from single-dose subset and thus each training set containing patients from a single dose group only. Specifically, for the low-dose subset and the high-dose subset, the training sample numbers were 1340 and 1478, and the ideal prediction percentages were around 70% (66.7% and 73.5% in internal validation; www.nature.com/scientificreports/ Table 1. Statistics on the factors in the training, internal validation, and external validation dataset (i.e., Set A, B, and C). APPT activated partial thromboplastin time, SD standard deviation, Saturated dose dose ranging from 5 to 10 mg/day; General dose, dose ranging from 2.5 to 5 mg/day. a Starting time of anticoagulation (X days after surgery) (days) were showed by median because of its right skewed distribution. Continuous variables materials were analyzed by using independent sample t test; Categorical data materials were analyzed by using chi-square analysis. The starting time of anticoagulation was analyzed by using Wilcoxon rank-sum test; The type of disease was analyzed by using Monte Carlo method.     Tables 4 and 5). These findings suggested that the training samples were adequate and the selected factors in building the models in this study were appropriate.

Discussion
In this work, we have assessed the predictive performance of a neural network built from a stratified training dataset for low-dose warfarin therapy. We have shown that the dose distribution of the training cases can affect the low-dose predictive performance of a feedforward neural network. When the training dataset was randomly stratified into the three subgroups (i.e., the low-dose, intermediate-dose, and high-dose) and all contain the same number of cases, the low-dose predictive performance of the neural network increased from 0.0 to 8.7% on internal validation and from 0.7 to 9.6% on external validation. This suggests that the feedforward neural network can predict low-dose warfarin requirements more accurately by resampling its training dataset. Several works have demonstrated the possibility that algorithms basing on machine learning can predict warfarin dose requirements more precisely 18,[23][24][25] . However, given that their study numbers are relatively small, the patients might not represent all warfarin users 15,24,25 . These works also usually focus on the overall predicting performance, with limited further discussion on subgroups. Neither did they perform external validation 24,25 . The studies by Qian et al. 23 and Tao et al. 18 indicate that predictive accuracy can be different across subgroups, whereas no specific solution was mentioned.
We carried out equal random stratified sampling on the dataset to obtain the training data 26 . We observed the improvement in low-dose predictive performance (there was also an increase in the high-dose subgroup), while the alternation in the data distribution also affected the predictive precision of the intermediate-dose subgroup with a loss of 3.0%. This dropping was possibly caused by the reduction in the training set. However, additional experiments carried out separately on each subgroup demonstrated that a training set of over 1000 cases was sufficient to obtain a high predictive precision (e.g., 95.6% with 1553 cases. see Tables 4 and 5 for more details), which also meant that the eight factors characterize patients well for warfarin maintenance dose prediction, especially for cases in the intermediate-dose range.
We further compared the three models trained on the three subgroups with the same relatively small number of cases. We noticed that all three models predict more precisely, and the predictive performance of the model trained on the intermediate-dose cases was better than the other two. We expect that a model built via multi-task learning would perform better in terms of overall prediction accuracy and subgroups dosing precision.
It is reported that genes CYP2C9 and VKORC1 can account for about 40% of the variation in warfarin dose requirements 27 . The International Warfarin Pharmacogenetics Consortium (IPWC) dosing algorithm also indicated that genetic data could improve a model's predictive performance 28 . We assume that involving genotype factors may further improve our predicting precision for the low-dose and high-dose cases, since genotype difference could be a primary contributor to the nonlinearity of the INR dose-response curve. At the time of this study, we did not have sufficient cases with genotype information; further works are needed to investigate whether genotypes have an equal effect on these three dose ranges.
Besides genotypes, we did not consider dietary factors either, for the difficulty in data collection and the vast regional diet difference across the regions where the 45 hospitals locate. The database lacked these factors. However, this kind of data reflects the typical patients in China, where cost-effectiveness is also an important issue to consider 29,30 . Some factors such as age, gender, and weight are of importance in clinic medicine. We initially considered and added age, gender, height, weight, ALT, and AST to the model, but founded that it did not improve the predictive effects obviously; we provided more experimental details on this issue in Table S2. For the sake of model simplicity, we did not include these factors in the model. Our data-driven model could be a supplement to clinical experience for clinicians in medication decision making. For limitations in the modeling, we restricted the neural networks to be three-layered for simplicity; we have not fully explored multi-layer structure, multi-task learning 31 , nor other neural network models. Drug interaction is a critical factor to influence the INR change.
To explore the warfarin anticoagulation therapy suitable for Chinese and eliminate the influence of other drugs affecting the efficacy judgement, the database excluded the patients who received the drugs affecting coagulation function and collected few information of drug interaction, so we did not include this factor in the model. We would collect the data of drug interaction in our future study.
Our study indicated that the warfarin dose distribution in the training data affected a neural network's predictive performance on the low-dose group. We expect that a genotype-guided version of our approach would predict more precisely. Further work is also needed to explore whether performance can be improved by an additional increase in the low-dose and high-dose population, e.g., through a repeated randomly stratified sampling process. Recently works on generative adversarial networks also indicate another possibility for data augmentation 32 .