A machine learning approach to personalized dose adjustment of lamotrigine using noninvasive clinical parameters

The pharmacokinetic variability of lamotrigine (LTG) plays a significant role in its dosing requirements. Our goal here was to use noninvasive clinical parameters to predict the dose-adjusted concentrations (C/D ratio) of LTG based on machine learning (ML) algorithms. A total of 1141 therapeutic drug-monitoring measurements were used, 80% of which were randomly selected as the "derivation cohort" to develop the prediction algorithm, and the remaining 20% constituted the "validation cohort" to test the finally selected model. Fifteen ML models were optimized and evaluated by tenfold cross-validation on the "derivation cohort,” and were filtered by the mean absolute error (MAE). On the whole, the nonlinear models outperformed the linear models. The extra-trees’ regression algorithm delivered good performance, and was chosen to establish the predictive model. The important features were then analyzed and parameters of the model adjusted to develop the best prediction model, which accurately described the C/D ratio of LTG, especially in the intermediate-to-high range (≥ 22.1 μg mL−1 g−1 day), as illustrated by a minimal bias (mean relative error (%) =  + 3%), good precision (MAE = 8.7 μg mL−1 g−1 day), and a high percentage of predictions within ± 20% of the empirical values (60.47%). This is the first study, to the best of our knowledge, to use ML algorithms to predict the C/D ratio of LTG. The results here can help clinicians adjust doses of LTG administered to patients to minimize adverse reactions.

Lamotrigine (LTG) is a second-generation antiepileptic drug used commonly for monotherapy or adjunctive therapy for focal seizures, absence seizures, and generalized tonic-clonic seizures 1 . The drug has also been approved as a mood stabilizer for bipolar disorder to prevent mood relapses 2 . It is frequently associated with cutaneous adverse drug reactions ranging from mild maculopapular eruption to severe Stevens-Johnson syndrome and toxic epidermal necrolysis 3 . The incidence of toxicity increases in definite relation to rising concentrations of LTG 4 .
Pharmacokinetic variability, which can be influenced by age, pregnancy, drug-drug interactions, and concurrent diseases, plays a significant role in the dosing requirements for LTG 5 . Therapeutic drug monitoring (TDM), as an essential part of personalized medicine, is valuable for adjusting doses of LTG, and is especially recommended when other co-administered drugs that induce or inhibit the metabolism of LTG are prescribed or discontinued in the treatment regimen 6 . Thus, patients characterized by a combination of environmental and biological factors may benefit from TDM to adjust doses of LTG to minimize adverse reactions.
Dose-adjusted concentrations (C/D ratio), namely the ratio of drug concentration to dose under steady-state and trough conditions, is a parameter that can be easily obtained from TDM data. It can be calculated by dividing the trough steady-state concentration of a drug by the dose that the patient is taking 7 . A high C/D ratio indicates slow drug clearance while a low C/D ratio implies the reverse 7 . The C/D ratio has proven to be a valuable tool to facilitate dosing adjustment because it can be used to identify possible associations between adverse drug reactions and pharmacokinetic parameters 8 , analyze pharmacokinetic variability 9 , measure non-adherence to medication 10 , detect drug-drug interactions 11,12 , distinguish pharmacogenetic phenotypes (e.g., poor or ultrarapid metabolizers) 12,13 , and estimate the dose required to achieve the desired concentration of the drug 13 .
Machine learning (ML) is defined as a field of study that enables computers to learn without being explicitly programmed 14 . It is a type of artificial intelligence that gives systems the ability to analyze a vast range of data collected from electronic health records (EHRs) and automatically learn from them using advanced statistical and probabilistic techniques to make more accurate predictions by constructing intelligent and effective predictive models 15 . Research in ML in the context of clinical medicine has revealed exciting advances in recent years, for example, the classification of medical images into diagnostic categories, detecting whether there are metastases on histological sections, and the automated segmentation of radiological images into known anatomical correlates to reduce diagnostics-related workload 16 . The application of ML to clinical drug therapies has garnered considerable research interest in recent years, and is playing an increasingly important role in the development of personalized dosing, especially in drug dose selection 17 . A few studies have been published on the application of ML to predict either drug doses or blood concentrations [18][19][20][21][22][23] . Jovanović et al. 18 explored the application of ML as an alternative to pharmacokinetics analysis. A summary of the ML algorithms and drugs used, the purposes of prediction, sample sizes, and results are shown in Table 1.
The traditional pharmacokinetic analysis is based on mathematically simple techniques, for example, calculations of the area under the concentration-time curve (AUC). However, if the data are insufficient or cannot support a pharmacokinetic modeling approach, the model is inaccurate 24 . Recent years have seen growing interest in new statistical approaches, such as population pharmacokinetic (popPK) analysis, which can be used to extract useful information from sparse data 25 . A variety of programs are available for population modeling, of which nonlinear mixed-effects modeling (NONMEM) is the most widely used to analyze pharmacokinetic data 26 . However, building a popPK model usually requires understanding and choosing various mathematical models (e.g., a pharmacokinetic structure model related to dose, a population model for inter-subject variability, and a variance model for random residual variation), where this process is time consuming 27 . Furthermore, adding or removing a parameter may also be complicated owing to the explicit analytical model used 24 . In contrast, ML is known for its self-organizational and learning capabilities, which enables computers to learn from "experience" without being explicitly programmed 27 . Many previous studies have reported that ML algorithms, such as artificial neural networks (ANNs), have accuracies equivalent to or even higher than the NONMEM method 27,28 . Poynton et al. 29 built an ensemble model by combining ANNs with NONMEM that was more accurate than either method.
ML algorithms fall into three categories: (1) supervised, (2) unsupervised, and (3) reinforcement learning. Supervised learning is used to predict classes or labels on unlabeled input data based on previously labeled data 30 . To date, the application of ML algorithms to predict the C/D ratio of LTG has not yet been reported. Therefore, in this study, we compare the performance of 15 supervised ML algorithms to identify the best one Table 1. Summary of several relevant studies on models to predict either drug dosage or blood concentration. CPANNs counter-propagation artificial neural networks, RT regression tree, BART Bayesian additive regression trees, MARS multivariate adaptive regression splines, SVR support vector regression, RFR random forest regression, CRT classification and regression tree, RE relative error, RMSE root mean squared relative error, MAE mean absolute error. a Therapeutic drug monitoring (TDM) measurements, with the number of patients in the parentheses. b The ideal rate was defined as the percentage of patients for whom the predicted doses were within 20% of the actual stable therapeutic doses of tacrolimus. c The percentage within 20% was defined as the percentage of patients for whom the predicted doses were within 20% of the actual stable therapeutic doses of warfarin. d Data was obtained from the International Warfarin Pharmacogenetics Consortium (IWPC) open access dataset. e The correct value (%) was defined as the percentage of patients for whom the predicted dose adjustment correctly fell within the smallest increment in doses of levothyroxine (12.5 mg). in terms of predicting the C/D ratio of LTG in Chinese patients based on noninvasive clinical parameters. We used clinical data to improve therapeutic efficacy while minimizing adverse effects. We also investigated clinical factors associated with the C/D ratio of LTG, and designed an easy-to-use web application as a real-time assisting clinical decision support tool for personalized adjustment to doses based on the proposed noninvasive predictive model. This is in the context of the costly implementation of the TDM because of the dedicated staff and equipment required 31 , which makes it unaffordable as a routine procedure in most hospitals in developing countries like China.

Results
Basic characteristics of the entire dataset. In the dataset used, the mean values of the serum concentration and daily dose of LTG were 5.52 (range: 0.50-18.55) µg/mL and 171 (range: 12.5-500) mg/day, respectively. The mean value of the calculated C/D ratio parameter was 35 (range: 4.0-147) μg·mL −1 ·g −1 ·day. The histogram and quantile-quantile (Q-Q) plot of the C/D ratio are shown in Supplementary Fig. S1a,c, indicating that it was normally distributed. The distribution of features in the entire dataset is shown in Table 2. None of the variables had a missing rate (i.e., the percentage of missing values) above 50% in our dataset. Note that the daily dosage of a drug was assumed to be zero if it had not been taken. A heatmap of Pearson's correlations between the C/D ratio and the variables was shown in Fig. 1.

Imputation evaluation.
A total of 184 missing data points were imputed. An overall comparison of models using the entire dataset and the omitted dataset is shown in Table 3. When all features and default parameters of the models were used, the nonlinear models outperformed the linear models on the whole. The imputation generated a bias in both linear and nonlinear models. However, when the results of t-tests of the training set were omitted, the values of the mean absolute error (MAE) in the test set showed no statistical significance between the entire dataset and the omitted dataset in most nonlinear models, indicating that the imputation had a smaller impact on the nonlinear models. Thus, the entire dataset was considered in the modeling study.
Comparison of performance of models. An overall comparison of the optimized predictive regression models in the derivation cohort of the entire dataset is shown in Table 4. The grid search-based parameter optimization for each ML algorithm is listed in Supplementary Tables S1-S15. Overall, the nonlinear models were superior to the linear models. All nonlinear models yielded lower MAE values than the multiple linear regression (MLR) model (statistically significant). The t-tests for the MAE in the test set showed that linear models delivered the same performance at a 95% confidence interval (CI). Among the best nonlinear models, the extra-trees' regression (ETR), k-nearest neighbor regression (KNR), bagging regression (BR), random forest regression (RFR), XGBoost regression (XGBR), gradient-boosted regression (GBR), and decision tree regression (DTR) models yielded similar values of the MAE in the test set, where the differences among them were not statistically significant (range: 9.4-10.1 μg·mL −1 ·g −1 ·day). The ETR algorithm, a tree-based ensemble algorithm, was chosen to establish the predictive model.
Optimizing ETR model. The ETR algorithm was employed to select important features. A tree-based method and forward feature selection were applied to compute feature importance and discard irrelevant features. Figure     To determine the optimal feature set, we identified the point at which there was no considerable change in the decline in MAE in the test set when a feature was added to the model. The results of the forward feature selection method showed that the MAE decreased gradually at first, as expected, and then tended to minimum values when the following features were used: daily dosage of VPA, age, BW, IND, and sex (male) (see Fig. 2b).
We also adjusted the key parameter n_estimators, which can be considered to be the number of trees, as it had the most significant influence on the performance of the ETR algorithm but did not affect the complexity of any model. Based on the method used to choose the forward parameter using tenfold cross-validation, the maximum mean cross-validation score (0.5093) was achieved on the test set when the parameter n_estimators (default value = 10) was increased to 31 (see Fig. 2d). Adjustments to other key parameters yielded higher scores when n_estimators was set to 31 by using tenfold cross-validation-based grid search (see Supplementary  Table S1). Below are the main parameters that were optimized for our final ETR model: (1) 'n_estimators': 31; (2) 'max_depth': 20; (3) 'min_samples_leaf ': 1; (4) 'min_samples_split': 6; and (5) 'max_features': 'auto' .
We then compared the predictive performance of the ETR model before and after feature selection and parameter adjustment on the derivation cohort. The optimal ETR model enhanced generalization by reducing the overfitting yielded a lower MAE on the test set (not statistically significant, P = 0.106), but obtained a higher MAE on the training set (statistically significant, P < 0.001) than the ETR model before optimization.
Validation and assessment of ETR model. Finally, to determine the overall predictive performance of the chosen prediction model, three indices, namely the MAE, mean relative error (MRE) (%), and percentage within 20%, were applied to the validating cohort. As is shown in Table 5, the predictive model was able to accurately describe the C/D ratio of LTG, especially in the intermediate-to-high range (≥ 22.1 μg mL −1 g −1 day), as illustrated by a minimal bias (MRE (%) = + 3%) and good precision (MAE = 8.7 μg mL −1 g −1 day). About 60.47% of all relative errors versus the observed C/D ratio ≥ 22.1 μg mL −1 g −1 day were in the ± 20% range. Overall, the model provided more accurate predictions in the intermediate and high ranges of the C/D ratio than its low range (see Fig. 3).
We compared the effect of the predictive performance of the ETR models after optimization on the derivation cohort in the context of modeling the C/D ratio and its log 10 -transformed parameter [log 10 (C/D ratio)] when Table 4. The mean absolute error (MAE) at confidence intervals (CI) of 95% for the prediction of doseadjusted concentrations of lamotrigine in the derivation cohort in the entire dataset for optimized regression models. MLR multiple linear regression, RidgeR ridge regression, LassoR lasso regression, LSVR linear-support vector regression, ETR extra-trees' regression, KNR k-nearest neighbor regression, BR bagging regression, RFR random forest regression, XGBR XGBoost regression, GBR gradient boosted regression, DTR decision tree regression, SVR support vector regression, NuSVR nu-support vector regression, MLPR multi-layer perceptron regression, ABR AdaBoost regression. *P values for the training set (the left side of the semicolon) and the test set (the right side of the semicolon), respectively.  Supplementary Fig. S1b,d). The grid search-based parameter optimization for the ETR model in the context of modeling log 10 (C/D ratio) is listed in Supplementary Table S16. Overall, the performance of the ETR model in terms of predicting the C/D ratio was superior to that in terms of predicting log 10 (C/D ratio) in the intermediate and high ranges of the C/D ratio (see Table 5).

Discussion
In addition to the popPK method 32 , other data-driven techniques have been used to predict the dose and/or concentration of LTG. For instance, Nakamura et al. 33 reported significant linear relationships between the LTG concentrations in week 2 and those at week 8 for Japanese patients of depression, some of whom were being administered VPA while the others were not. They estimated an optimal dose of LTG by building regression equations. Yamamoto et al. 34 studied factors influencing LTG concentrations in Japanese patients of epilepsy   www.nature.com/scientificreports/ by stepwise multiple regression analysis that enabled them to estimate the LTG concentration by applying the coefficients of these factors to a multiple regression model. However, to the best of our knowledge, scarcely any study has used ML methods to predict the C/D ratio, where this is a key parameter for analyzing pharmacokinetic abnormalities 7 . Our study, therefore, fills this gap in the literature. Our present work here has improved on these abovementioned studies, as reflected in the following: First, even though datasets related to the LTG have been explored in these studies, the TDM measurements were not adequate. Moreover, these datasets did not include the trough concentrations used to calculate the C/D ratio or the co-administered drugs that induce the hepatic drug-metabolizing enzymes. However, our dataset was based on Chinese population, and had a larger number of samples of trough LTG concentrations and more comprehensive variables (including age, gender, BW, and IND). Second, unlike past work that has focused on linear regression models, our modeling study considered both linear and nonlinear regression models. ML can help detect the nonlinear and complex interactions between variables by minimizing the error between the predicted and the measured values 35 . Third, our models were trained and tested on the derivation cohort, and validated on another validation cohort randomly split from the entire dataset. Such external validation verifies the generalizability of our models to new data 36 . Finally, the superiority of ML in terms of developing predictive models becomes clearer on larger populations with greater numbers of predictors. The relevant factors may not be recognized owing to missing values. However, the ML method involves imputing these missing values 35 .
No one ML algorithm is the most accurate in all cases, and thus comparisons of ML algorithms in different fields of research, and on different datasets, may yield different results 37 . In this study, a comparison of the regression models showed that other linear models performed similarly on the test set to the MLR model, a frequently used and simple statistical model that makes it easy to interpret the variables 38 . A possible explanation for this is that these linear models had the same optimal feature set (see Supplementary Tables S12-S15), and no multi-collinear relationship was obtained between the features (see Fig. 1). Our results also indicate that the nonlinear models were statistically significantly stronger than the MLR model, which can be in part attributed to the weak linear correlations between consecutive parameters (see Fig. 1) and the high signal-to-noise ratio 35 . Compared with linear models, the nonlinear models did not require linear relationships between the parameters. Moreover, many factors affect the outputs, particularly in nonlinear, dynamic disease states. Therefore, the dataset may inevitably contain noise that can affect linear regression models more than nonlinear regression models 38 .
Better features mean flexible, simple models that yield good predictive results. A total of five patient variables were identified by the variable selection process of the extra-trees' algorithm to develop the noninvasive predictive model, daily dosage of VPA, age, BW, IND, and sex, where the daily dosage of VPA and IND were identified as the most important positive and negative predictors, respectively. Uridine glucuronosyltransferases (UGTs) play an essential role in the metabolism of LTG. LTG dosage in adjunctive therapy is usually dependent on the interactions of LTG with co-administered drugs. An enzyme inhibitor is a type of drug that binds to the enzyme and reduces its metabolic activity. As a broad-spectrum enzyme inhibitor, the VPA can inhibit the metabolic activities of many hepatic drug-metabolizing enzymes to different extents 39 . An IND is a type of drug that binds to the enzyme, activates it, or increases its gene coding expression, and then increases its metabolic activity. Previous studies have reported that IND induces the metabolism of LTG, increases its clearance, and lowers its serum levels by 34-52%, whereas VPA inhibits the metabolism of LTG, decreases its apparent clearance by 38.5%, and raises its serum levels twofold [40][41][42] . A previous study revealed that the clearance of LTG increases to a maximum at 36 years and then gradually decreases in adult patients 43 . In general, in agreement with the earlier study, we found that the data points with a high C/D ratio were more likely to be distributed among younger or older patients (see Supplementary Fig. S2a), and the kernel density analysis showed that the data points with a low C/D ratio were concentrated in patients 20-30 years of age (see Supplementary Fig. S2b). Due to the difference in the gene coding expression or metabolic activity of the UGTs, sex may influence the pharmacokinetics of LTG 44 . VPA is known to have endocrinal side effects, and is likely to affect fertility 45 . Thus, young women likely tend to use other drugs instead. In the context of the co-administration of the VPA, the TDM measurements for men were 2.3 times those for women in our study (see Supplementary Fig. S2c). This can in part explain the weakly positive correlations between the C/D ratio of LTG, and age and sex (male) in our study. Brzaković et al. 44 reported that BW may influence the interactions of LTG, and overweight patients may be less susceptible to these interactions. BW was identified as another important predictor for our model but showed an insignificant correlation with the C/D ratio via Pearson's correlation analysis. This might be due to their nonlinear relationship (see Supplementary Fig. S2d), as Pearson's correlation coefficient is sensitive to only linear relationships.
The prediction model also showed a much higher variance at low C/D ratios, possibly due to the overfitting problem in the training set, as well as a greater influence of uncontrolled factors like genetic effects and TDM measurement errors. Even so, our prediction model in general performed better in the intermediate and high ranges of the C/D ratio (lower MAE and MRE (%), and a higher percentage within 20%). Moreover, patients in these ranges were more likely to benefit more from our model because a high C/D ratio indicated slow drug clearance 7 ; hence, such patients may suffer from overdose and toxicity more easily. The analysis revealed that multiple factors (age, gender, BW, and co-administered drugs) had an impact on the C/D ratios of LTG. However, not all these factors were considered in the recommended dosing regimens according to the latest package inserts of LTG approved by the National Medical Products Administration (NMPA). For instance, the recommended usual maintenance dose for LTG monotherapy in patients of epilepsy older than 12 years is 100-200 mg/day, while a strict rate of dose escalation, starting at low doses, is recommended owing to the possibility of life-threatening rashes. To promote the clinical applications of our model, we designed an easy-to-use web application for personalized dose adjustments by allowing prescribers to input the characteristics of patients to estimate their LTG dosing needs. For example, according to the latest Arbeitsgemeinschaft für Neuropsychopharmakologie und Pharmakopsychiatrie (AGNP) guideline for TDM in neuropsychopharmacology 7 , the recommended therapeutic reference ranges for LTG are 3-15 μg/mL as an anticonvulsant drug and 1-6 μg/mL as a mood-stabilizing drug. www.nature.com/scientificreports/ Given that the C/D ratio was predicted to be 60 μg mL −1 g −1 day, an estimated maintenance dose of 50 mg/day (= 3/60 × 1 000 mg/day) was required for adult patients of epilepsy to reach 3 μg/mL, and 250 mg/day (= 15/60 × 1 000 mg/day) to reach 15 μg/mL (assuming good therapeutic responses) 7 . If the predicted C/D ratio was assumed to be 80 μg mL −1 g −1 day due to changes in the patients' BW, ignoring problems of adherence and drug-drug interactions, estimated maintenance doses of 37.5 mg/day (= 3/80 × 1000 mg/day) and 187.5 mg/day (= 15/80 × 1 000 mg/day) were required to reach the target concentrations of 3 μg/mL and 15 μg/mL, respectively. Thus, unlike the dose recommended for most patients in the package inserts, this application can tailor doses for specific patients. Further, our model can be implemented within electronic health systems. Thus, once the TDM has been implemented, it automatically crawls the information on the characteristics of patients using their EHRs as well as the TDM measurements. Such implementation would make it easier to increase the size and variety of the dataset, in which case the ML algorithm has enough data to learn to generate robust predictions. More importantly, these designs, particularly the ML-based algorithm, can be applied to other drugs with the use of TDM highly recommended as a clinical routine for dose optimization. A snapshot of this application, and the model's self-learning and optimizing process are depicted in Fig. 4. Despite the promising results, there is room to improve the predictive models overall. Several key limitations of this study should be noted. First, the sample size used in our model was small, and a much larger number of samples is needed for further development and evaluation. In future work, using the abovementioned web application, we should be able to collect and store data more efficiently, and use big data to optimize the prediction model. Second, owing to the use of retrospective data instead of prospective in our study, certain uncontrollable factors were unavoidable. For example, the C/D ratios might have decreased due to nonadherence 46 . The inaccurate timing of the blood collection might also have led to the fluctuation in LTG concentration, especially in the case of the co-administration of IND 47 . The influence of inevitable experimental errors on the C/D ratios, especially its low range, could not be accurately evaluated either. Finally, the variables used in our model, though readily available, are all non-genetic factors. The effects of genetic polymorphisms on LTG pharmacokinetics have been widely explored in previous studies, but have yielded some conflicting findings 48,49 . Thus, more variables (e.g., genetic factors) needed to be included in the model. Despite these limitations, by using only noninvasive clinical parameters, the proposed model has the potential to help adjust doses of LTG and minimize adverse reactions. However, note that it is not an all-powerful tool in clinical practice because complex interactions (e.g., gene-environment) play a major role in many disorders, such as neurological and psychiatric disorders 50 . Hence, even though our model has achieved promising results, the best way of personalizing LTG doses is to monitor the serum concentration of LTG and attend to its clinical response.

Conclusion
The results here proved the feasibility of ML algorithms, especially the nonlinear model, for predicting the C/D ratio of the LTG and identifying important predictors. By using the extra-trees' regression algorithm (one of the best nonlinear models) and noninvasive clinical parameters (including the daily dosage of the VPA, age, BW, IND, and sex), the proposed predictive model delivered good performance, especially in the intermediate-to-high range of the C/D ratio of LTG (≥ 22.1 μg mL −1 g −1 day). Furthermore, an easy-to-use web application based on the proposed predictive model was designed as a real-time assisting clinical decision support tool to help with the personalized dose adjustment of LTG, thus improving its therapeutic efficacy while minimizing its adverse effects. Finally, the application of the proposed predictive model requires further improvement overall.  The requirement for informed consent was waived by the Ethics Committee of the Affiliated Brain Hospital of Guangzhou Medical University that approved the study, owing to the retrospective nature of the analyses. This study was carried out in compliance with the guidelines of the Helsinki Declaration. TDM measurements of LTG were considered for enrollment. If the patient was in a steady state (usually therapy with a stable dose for at least four to six half-lives 7 , trough sampling was performed before the next dosage, and the concentrations were not below the lower limit of quantification (LLQ). Each of the 1141 C/D ratios calculated, along with its corresponding covariates (including demographic and clinical information), was treated as a new input (i.e., feature) and output (i.e., label) data pair. The categorical covariates included sex (male or female) and the co-administration of IND (yes or no). The continuous covariates included age, BW, and daily dosages of co-administered drugs (including VPA, CBZ, OXC, PB, and PHT) when the concentration of LTG was determined.
Bioanalysis. The serum concentration of LTG was determined by a validated, analytical, high-performance liquid chromatography (HPLC) method that included an HPLC system (Agilent 1260; Agilent Technologies, Inc., Santa Clara, USA). The serum samples were extracted by methanol as protein precipitators. Pirfenidone was taken as the internal standard. Eclipse Plus C 18 column (150 mm × 4.6 mm, 5 μm) was used with the mobile phase of methanol: water (5 mmol/L ammonium formate) = 52: 48 (v/v) at a flow rate of 0.8 mL/min, a column temperature of 40 °C, detection wavelength of 310 nm, and injected volume of 10 μL. The standard curve was linear (r 2 = 0.9933), and ranged from 0.5 μg/mL to 20 μg/mL. The LLQ was validated at 0.5 μg/mL. The mean extraction recoveries of the LTG from plasma at three quality control (QC) levels were found to be over 91.24%, whereas the intra-and inter-day precisions were both less than 11.54%. ML strategies. We selected clinical factors related to the C/D ratio of LTG, and built a prediction model to adjust its doses and minimize adverse reactions. Our ML process can be divided into five steps: (1) data preprocessing, (2) imputation evaluation, (3) model optimization, (4) model selection, and (5) model validation. The overall modeling process is illustrated in Fig. 5.
Data preprocessing. The process of data cleaning and transformation is intended to solve problems of missing values and data normalization. Before analysis, summary and descriptive statistics were obtained, with a review of the missing values. If the rate of absence of a variable in the original dataset was above 50%, it was removed 51 .
The missing values needed to be computed before the data were input into the model. We filled in the missing values by imputation using the k-nearest neighbor (KNN) method with k = 3 51 . Data normalization was also performed to obtain data of higher quality. Categorical variables were coded using "one-hot" encoding, and continuous variables were rescaled using min-max rescaling 52 .
Imputation evaluation. The imputation evaluation was based on two datasets: the entire dataset (N = 1141) with missing data points imputed, and the omitted dataset (N = 957) derived from the entire dataset but with the missing data points omitted. For each dataset, the raw data were randomly split into two batches. Eighty percent of the raw data were randomly selected as the "derivation cohort", and the remaining were assigned to the "validation cohort." To explore whether imputation generated a bias, we employed tenfold cross-validation to compare the performance of models using the two datasets above 53 : the entire derivation cohort was divided into 10 subsets, and the holdout method was repeated 10 times. Each of the subsets was used in turn as the test set and the other nine subsets were combined to form a training set (see Fig. 5). The average MAE across all 10 tests was then computed. All features in the dataset were considered, and the hyperparameters for each algorithm used in these comparisons were set to their default values specified in the package 54 . They are listed in Supplementary Table S17. R 2 , MAE, mean square error (MSE), and root mean square error (RMSE) are commonly used evaluation metrics for regression models in ML. Measures of dispersion, such as the MAE and RMSE, are preferred over R 2 because the value of a model generally lies in its overall accuracy and precision, but not on how successfully it explains the dependent data variance 55 . To explore the impact of imputation on prediction, that is, the differences in MAE in the test set before and after omitting the missing data points, paired sample t-tests were performed. Defined as the average of the absolute values of the predicted C/D ratio ( y i ) minus the observed C/D ratio ( ŷ i ), the MAE was calculated according to Eq. (1): www.nature.com/scientificreports/ Model optimization. Model optimization involved feature selection and parameter adjustment. Before this step, raw data from the entire dataset were randomly split into two batches. Eighty percent (912 measurements) of the TDM measurements were randomly selected as the "derivation cohort" to develop an optimal regression model. The remaining 20% (229 measurements) were assigned to the "validation cohort," which was used to test the performance of the final model in terms of handling unseen samples. Only the derivation cohort was used for feature selection and parameter adjustment. Feature selection, which is crucial for developing a better and simpler model, is the process of selecting a subset of relevant variables (features) to be used for regression by removing redundant and/or irrelevant ones 56 . Here, an embedded approach to feature selection was chosen during training, together with a suitable regression algorithm, such as the algorithm evaluated here or the random forest regression algorithm, if the evaluated algorithm had no attribute with "feature_importance_. " Features  www.nature.com/scientificreports/ that best contributed to the model were learned while the model was being created 56 . They were then ranked according to measured importance. However, an optimal feature set should have had the fewest variables but the best predictive performance. Therefore, a forward feature selection strategy was used to find the shortest list of features. Briefly, forward feature selection began at the top of a ranked list of features, and iteratively generated increasingly longer lists, along with the corresponding models and their predictive performances on the test set using tenfold cross-validation by adding one feature at a time 57 . The C/D ratio and the corresponding feature covariates selected above were evaluated via Pearson's correlation visualization and heatmap analysis.
Once the optimal number of features had been determined, parameter adjustment was needed to optimize the learning algorithm. In general, we did this using tenfold cross-validation by choosing important hyperparameters first. The optimal parameters were filtered by mean cross-validation scores on the test set. As in the forward feature selection strategy, the parameter adjustment strategy involved iteratively generating increasingly larger parameter values and their corresponding scores. Finally, tenfold cross-validation was used in a grid search of the other key tuning parameters.
Model selection. As in feature selection and parameter adjustment, only the derivation cohort in the entire dataset was used for model selection. A total of 15 supervised ML algorithms were applied and evaluated: Ada-Boost regression (ABR), bagging regression (BR), decision tree regression (DTR), extra-trees' regression (ETR), gradient-boosted regression (GBR), k-nearest neighbor regression (KNR), lasso regression (LassoR), multiple linear regression (MLR), linear support vector regression (LSVR), multi-layer perceptron regression (MLPR), nu-support vector regression (NuSVR), random forest regression (RFR), ridge regression (RidgeR), support vector regression (SVR), and XGBoost regression (XGBR). The hyperparameters of each algorithm were optimized and the corresponding optimal feature set was determined according to the methods of feature selection and parameter adjustment considered. Considering the small size of our dataset, and to avoid bias, we employed tenfold cross-validation to compare the performance of the models 53 . They were then filtered by the MAE in the test data. To test the differences in the predictive performance of the models, unpaired student's t-tests were performed. One of the better models was then selected for validation, and its performance before and after optimization was compared using a paired sample t-test.
Model validation. The shortest list of features and the most suitable parameters were determined for the chosen algorithm to yield a better predictive model. Model validation was performed using three indices: MAE, MRE (%), and the percentage of predictions within 20% of the observed C/D ratios (percentage within 20%) in the validating cohort. The MRE (%), defined as the average of the ratio of an error in the predicted C/D ratio ( y i ) to the magnitude of the observed C/D ratio ( ŷ i ), was calculated according to Eq. (2): We selected the percentage within 20% because this range has been widely accepted and applied to assess models for predicting the dosages of such drugs as warfarin and tacrolimus (see Table 1) [19][20][21][22] . Moreover, ignoring problems of adherence and pharmacokinetic alteration, the intra-individual variation in the C/D ratio is generally considered to not be above 20% 7 . The above indices were calculated overall for the validation cohort, and in terms of the range of the C/D ratio based on the 25% quartile of the unseen data points, where this was divided into two categories: low C/D ratio range (< 22.1 μg mL −1 g −1 day) and intermediate-to-high C/D ratio range (≥ 22.1 μg mL −1 g −1 day). Given the wide range of values of the C/D ratio, we also considered modeling log 10 (C/D ratio) when treated as a label by using the proposed algorithm. In the context of modeling the C/D ratio and log 10 (C/D ratio), the respective performances of the optimized models on the validation cohort based on different ranges of the C/D ratio were compared.