Physicochemical properties of dietary phytochemicals can predict their passive absorption in the human small intestine

A diet high in phytochemical-rich plant foods is associated with reducing the risk of chronic diseases such as cardiovascular and neurodegenerative diseases, obesity, diabetes and cancer. Oxidative stress and inflammation (OSI) is the common component underlying these chronic diseases. Whilst the positive health effects of phytochemicals and their metabolites have been demonstrated to regulate OSI, the timing and absorption for best effect is not well understood. We developed a model to predict the time to achieve maximal plasma concentration (Tmax) of phytochemicals in fruits and vegetables. We used a training dataset containing 67 dietary phytochemicals from 31 clinical studies to develop the model and validated the model using three independent datasets comprising a total of 108 dietary phytochemicals and 98 pharmaceutical compounds. The developed model based on dietary intake forms and the physicochemical properties lipophilicity and molecular mass accurately predicts Tmax of dietary phytochemicals and pharmaceutical compounds over a broad range of chemical classes. This is the first direct model to predict Tmax of dietary phytochemicals in the human body. The model informs the clinical dosing frequency for optimising uptake and sustained presence of dietary phytochemicals in circulation, to maximise their bio-efficacy for positively affect human health and managing OSI in chronic diseases.

post-ingestion and completely cleared over the next few days 19 . Additionally, dietary intake forms of phytochemicals may also have an impact on their T max in the body 20 . Ellagic acid from a pomegranate extract was reported to have a T max of 0.5-1 h when ingested as liquid form, but 2-3 h when ingested in a solid form 21 . It is possible that previous studies have underestimated the OSI-reducing effects of dietary phytochemicals if blood sampling was performed outside the timespan of T max in the body. For example, no effects of vitamin C supplementation (1 g/d) on plasma biomarkers of OSI were reported after either 1 day or 2 week treatment durations 22 . However, bolus dose of vitamin C given 2 h before exercise prevented exercise-induced OSI 23 . The inconsistency in findings of bio-efficacy of vitamin C could be due to the time of blood sampling that mismatched the short T max of vitamin C (~3 h 24 ). The timing of dietary phytochemical consumption relative to OSI challenges (e.g., meal or exercise) could be an important factor in understanding and optimising the health benefits of phytochemicals.
Oral bioavailability of phytochemicals can be informed by the application of in silico modelling widely used in pharmaceutical sciences 25 and drug discovery 26 . These models correlate in vitro and/or in vivo passive absorption of drugs with their chemical structures described by physicochemical properties to predict the absorption of similar compounds 27 . Physicochemical properties of importance in drug absorption include molecular mass (M r ), lipophilicity (expressed as the logarithm of the partition coefficient between water and 1-octanol, log P), number of hydrogen (H) donors and acceptors 28 , polar surface area (PSA), number of freely-rotatable bonds 29 and molecular volume 30 . Multiple models have been developed to predict absorption kinetics and bioavailability of pharmaceutical compounds 27 . However, there is currently no such model for predicting T max of dietary phytochemicals from physicochemical properties.
The aim of this study was to determine if T max of dietary phytochemicals in healthy individuals could be predicted from standard physicochemical properties and dietary intake forms. To develop the predictive model, we used a training dataset that modelled the T max of 67 dietary phytochemicals collected from 31 clinical studies of healthy volunteers 18,19,21,24, to their calculated physicochemical properties. To validate the predictive model for dietary phytochemicals, we used an independent phytochemical validation dataset (PCv) containing 108 dietary phytochemicals collected from a further 34 clinical studies  . We validated the predictive model using pharmaceutical compounds and evaluated the effects of food on the prediction accuracy of the model by using two datasets containing 60 pharmaceutical compounds ingested without food (PHv-fasted) 92-148 and 38 pharmaceutical compounds ingested with food (PHv-fed) 92-95, 97, 98, 102-104, 106-111, 113, 116, 117, 121, 122, 126, 128, 130-133, 136, 138, 140, 143-146, 148-151 . This study demonstrates that physicochemical properties and dietary intake forms can be used to predict T max of dietary phytochemicals and pharmaceutical compounds when ingested without food.

Results
Correlation analysis of the training dataset. The model training dataset contained 11 variables including T max , 8 physicochemical properties and 3 categories of dietary intake forms (Supplementary Table S1). The included physicochemical properties were M r , log P, PSA, number of freely rotatable bonds, number of H donors, number of H acceptors and molecular volume. As there is a high correlation between variables, multi-collinearity affects the estimation of the coefficients and inflates the standard errors (SE). Therefore, to investigate the relationships between the physicochemical properties in the training dataset, Pearson correlation analyses were performed. Table 1 provides these Pearson's correlation coefficients (r) with their associated P-values. Significantly high correlations (|r| > 0.75, P < 0.05) were observed between M r and number of freely rotatable bonds (r = 0.772, P < 0.001), M r and molecular volume (r = 0.949, P < 0.001), log P and number of H acceptors (r = −0.755, P < 0.001), number of freely rotatable bonds and molecular volume (r = 0.901, P < 0.001), number of H acceptors and H donors (r = 0.949, P < 0.001), number of H acceptors and PSA (r = 0.998, P < 0.001), number of H donors and PSA (r = 0.955, P < 0.001). For correlated variables, only one of the baseline variables was chosen to be included in the predictive model and were M r , PSA and log P.
To test the effects of dietary intake forms, Pearson correlation analyses between T max , M r , PSA and log P were performed with the inclusion of dietary intake forms (liquid, semi-solid and solid). high correlations between PSA and log P in the liquid intake form (r = −0.82, P < 0.001) and in the semi-solid intake form (r = −0.93, P < 0.001). Therefore, the predictive model of T max was developed including 2 separate models: the 'log P model' containing log P and M r and the 'PSA model' containing PSA and M r .

Development of the predictive model.
To develop the predictive model of T max for phytochemicals, we used regression modelling with a natural logarithm transformation of T max (ln (T max )) and standard error (SE) of T max as weights to account for the uncertainty of each data point. We used the training dataset containing 67 phytochemicals collected from 31 clinical studies with a total number of 384 healthy participants ( Table 3). The predictive model included 2 mathematical models: the log P model and the PSA model that appeared to approximately equally well fit the data with coefficients depending on dietary intake forms (Fig. 1). All models had statistical power of >0.999. The log P model estimated T max based on log P and M r (Fig. 1a-c). When phytochemicals were administered in liquid form, ln (T max ) was positively associated with log P and M r (Fig. 1a). When phytochemicals were administered in semi-solid (Fig. 1b) or solid (Fig. 1c) forms, ln (T max ) was independent of M r and followed a quadratic relationship with log P. The PSA model estimated T max based on PSA and M r (Fig. 1d-f). In the PSA model, ln (T max ) was positively associated with M r and negatively associated with PSA. Overall, the predictive model covered a M r range of 122-1270, a log P range of −4.7-9.8 and a PSA range of 0-465 Å 2 corresponding a T max range of 0.3-32.6 h ( Table 3). Distribution patterns of log P, M r and PSA in the training dataset were demonstrated in Fig. 2. Log P was relatively evenly distributed across the range from −4.7-3 and 8.7-10 ( Fig. 2a). Therefore, the   Table 3. Summary of datasets for development and validation of the predictive model. a T max of phytochemicals were collected from clinical studies in the literature. b Physicochemical properties of phytochemicals including M r , Log P and PSA were calculated using the Molinspiration Chemoinformatics calculator.
log P model had to interpolate values between 3 and 8.5 because they were not represented in the training dataset. M r and PSA of the training dataset were evenly distributed ( Fig. 2b and c). The prediction accuracy of the log P model and the PSA model in the training dataset was assessed by the root mean weighted square error normalized by the weights (RNMSWE) and the percentage relative error (%RE) of predictions ( Table 4). Comparison of the measured versus predicted values of ln (T max ) was plotted in Fig. 3a-c. The RNMSWE of prediction is an estimate of the standard deviation of the prediction normalized by the weights. As T max required a natural logarithm transformation, the RNMSWE in ln (hours) was transformed to %RE of prediction which is approximately average % error of T max (in hours) over the mean of T max (in hours). The %RE of prediction of the log P model was 18.27%, 19.13% and 47.08% for the liquid, semi-solid and solid intakes, respectively. The %RE of prediction of the PSA model was 37.46%, 25.43% and 45.8% for the liquid, semi-solid and solid intakes, respectively (Table 4). Overall, for the training dataset, despite the similar R 2 , the log P model had lower %RE of prediction across all three intakes and thus higher prediction accuracy. Validation of the predictive model. To validate the predictive model, we used three independent datasets: the PCv, PHv-fasted and PHv-fed datasets. In comparison with the training dataset, all three validation datasets covered smaller ranges of log P, M r and PSA (Table 3, Fig. 2). The PCv dataset contained phytochemicals of similar chemical classes to the training dataset whilst the PHv-fasted and the PHv-fed datasets contains pharmaceutical compounds. The PCv dataset contained 108 phytochemicals including anthocyanins, flavanols, flavonols, hydrobenzoic acids, hydroxycinnamic acids, stilbenes, carotenoids and vitamins (Supplementary Table S2). Comparing to the training dataset, the PCv dataset covered a similar range of log P of −4.7-10 and measured T max of 0.5-37 h (Table 3) with sparsely distributed data of log P (Fig. 2a). Log P values of the PCv dataset were more concentrated in the range of −2.8-−2.5 and 1.2-2.3. Similar to the training dataset, the PCv dataset lacked log P values from 5.6-8.4 (Fig. 2a). The PCv dataset covered a M r range of 138-758 and a PSA range of 0-330 Å 2 (Table 3, Fig. 2b and c). In comparison the training dataset, M r and PSA of the PCv dataset were less evenly distributed ( Fig. 2b and c).
To evaluate the prediction accuracy of the predictive model on the PCv dataset, we compared the measured versus predicted values of ln (T max ) in Fig. 3d-f and calculated the %RE in Table 4. The %RE of prediction of the log P model was 55.84%, 57.07% and 76.7% for the liquid, semi-solid and solid intakes, respectively. The %RE of prediction of the PSA model was 66.07%, 92.95% and 89.4% for the liquid, semi-solid and solid intakes, respectively (Table 4). Overall, for the PCv dataset and in comparison with the PSA model, the log P model had lower %RE of prediction across three intakes and thus higher prediction accuracy. Comparing to the training dataset, the PCv dataset had higher %RE of prediction and thus lower prediction accuracy across all intake forms.
To validate the predictive model on pharmaceutical compounds, we used two pharmaceuticals validation datasets: PHv-fasted and PHv-fed. All pharmaceutical compounds in the two datasets were administered in the solid form (Table 3). The PHv-fasted dataset contains 60 compounds collected from 59 clinical studies and the PHv-fed dataset contains 38 compounds collected from 37 clinical studies (Table 3). The entire list of pharmaceutical compounds in the PHv-fasted dataset can be found as Supplementary Table S3 and the PHv-fed dataset  as Supplementary Table S4. The two PHv datasets covered a similar range of log P −1.7-5.4 (Table 3) with a similar distribution pattern (Fig. 2a). Comparing to the PHv-fasted dataset, the PHv-fed dataset covered a slightly broader range of M r of 123-823 and PSA of 3-221 Å 2 while the PHv-fasted dataset covered M r range of 123-552 and PSA of 3-146 Å 2 (Table 3). Similar distribution patterns of M r and PSA were observed in the two PHv datasets (Fig. 2b and c).
To evaluate the effects of food on the prediction accuracy of the model, we compared the measured versus predicted values of ln (T max ) in Fig. 4 and calculated the %RE in Table 4. The %RE of prediction for the log P model was 45.18% for the PHv-fasted dataset and 93.37% for the PHv-fed dataset. The %RE of prediction for the PSA model was 162.69% for the PHv-fasted dataset and 92.01% for the PHv-fed dataset (Table 4). For the log P model, food increased the %RE of prediction and therefore reduced the prediction accuracy. By contrast, for the PSA model, food reduced the %RE of prediction and thus increased the prediction accuracy. Overall, the log P model and PSA model had similar %RE for the PHv-fed dataset. However, the log P model had substantially lower %RE for the PHv-fasted dataset and thus had higher prediction accuracy.

Discussion
This is the first direct model to predict the time of maximal plasma concentration (T max ) of dietary phytochemicals in the human body based on their physicochemical properties and dietary intake forms. The model was developed based on T max data from clinical studies of healthy individuals and therefore predicts the absorption of phytochemicals in the human body. To select the most important variables for the predictive model, we analysed the correlation between several physicochemical properties that are well known in pharmaceutical science to have significant impacts on oral bioavailability of drugs such as molecular mass, lipophilicity, polar surface area, molecular volume, number of freely rotatable bonds, number of hydrogen donors and acceptors [28][29][30] . We found significantly high correlation between some of the physicochemical properties and selected three independent physicochemical properties to use in the model including molecular mass, lipophilicity and polar surface area. These phytochemical properties were selected due to their well-known impacts on drug bioavailability as they are related to intestinal membrane permeability of a compound 28,29 . In order for a drug to cross the membrane, the compound needs to break hydrogen bonds with its aqueous environment and partition through the membrane 152 . Polar surface area is related to the hydrogen-bonding potential of a compound whilst molecular mass and lipophilicity are related to the membrane permeability. Consistent with the literature 28,29,152 , we found that these physicochemical properties had significant impacts on the T max of dietary phytochemicals in the human body. Further, dietary intake forms were also identified to have a significant impact on absorption of dietary phytochemicals and were included in the model development. Similar to drug compounds, the effects of dietary intake forms on bioavailability of phytochemicals are related to the dissolution of phytochemicals within the gastrointestinal tract making them available for absorption 153 Table 4. Comparison of prediction accuracy of the predictive model for each dataset.
phytochemicals consumed in the semi-solid or solid forms would require longer time to dissolve into the gastrointestinal environment before they are available for absorption. The predictive model based on lipophilicity and molecular mass provides a quantitative and high-throughput tool for prediction of T max of dietary phytochemicals and also pharmaceutical compounds ingested without food. T max of a phytochemical or pharmaceutical compound that has not been studied in vivo can thereby be calculated from its molecular mass and log P for three different intake forms of liquid, semi-solid or solids using the equations reported in this predictive model (Fig. 1a-c). For example, phytochemical phloretin (M r = 274.27, log P = 2.66) found in apple would be predicted to have T max of 1.05, 0.62 and 1.6 h when consumed in liquid, semi-solid and solid forms, respectively. The model covers a broad range of chemical classes from phenolic compounds to carotenoids, from very hydrophilic (log P ~ −4.7) to very lipophilic (log P ~ 10) with a wide molecular mass range of M r ~ 122-1270. The prediction accuracy of the model was indicated by relative error of prediction from 18-77% for total 175 dietary phytochemicals tested and 45% for 60 pharmaceutical compounds ingested without food ( Table 4). The relative error of prediction is an indication of the total error of prediction compared to the mean. Our literature searches show that published T max have a SE between 0 and 200% of the mean (Supplementary Tables S1-S4). Therefore, the prediction accuracy of our model was deemed adequately accurate for valid prediction of T max . Additionally, considering that a statistical power of 0.8 is the standard for adequacy 154 , our model with power of >0.999 had high statistical power for confidence in preduction accuracy.
The predictive model was of course limited by the literature reports of the experimental data. The T max variable was logarithmically transformed to alleviate the non-normality of the errors. However, there were gaps in the independent variables of log P from 3-8.5 and M r from 750-1270 that the model had to overcome (Fig. 2). Therefore, further data covering a complete range of the parameter space would increase the rigour of the model. Additionally, we observed an increase of relative error of prediction for pharmaceutical compounds when ingested with food (Table 4). Mechanisms whereby food affects the bioavailability of drug absorption have been well studied. Food promotes absorption of lipophilic drugs due to improved drug solubilisation whilst reducing absorption of hydrophilic drugs due to delayed drug permeation 155 . Similar effects of food on absorption of dietary phytochemicals have been observed 20 . Increased absorption of the lipophilic compound lycopene in tomato was reported when consumed with olive oil 156 . Hydrophilic compounds such as phenolic acids and anthocyanins were observed to bind to fibre and compromised their absorption during stimulated gastric and small intestinal digestion 157 . Further, protein in food has been reported to reduce absorption of dietary phytochemicals in chocolate 158 . Our predictive model was developed based on dietary phytochemicals administered as single-source phytochemicals or phytochemical extracts and also phytochemicals consumed in their natural matrices of whole fruits and vegetables (Supplementary Table S1). Apart from the models for phytochemicals consumed in liquid (Fig. 1a) or solid (Fig. 1c) forms, mostly in isolation or extracts, a statistically valid model was also developed from consumption of phytochemicals mostly (75%) in whole fruits and vegetables and accounted for the effects of these matrices on phytochemical absorption in semi-solid form (Fig. 1b). Therefore, the effects of interactions of phytochemicals with macronutrients such as fibre and protein from the natural matrices were accounted for to a small extent. Accordingly, Conversely, the impact of macronutrients from food sources other than natural plant food matrices on T max of phytochemicals are not accounted for. Considering that macronutrients are known to interact with phytochemicals and thereby alter their T max 20 , the developed model may less accurately predict the T max of phytochemicals when consumed in conjunction with other foods. Accordingly, the predictive model reported herein is most applicable for prediction of T max of dietary phytochemicals and pharmaceuticals ingested without foods.
In this study, the time of maximal plasma concentration (T max ) was chosen as the most relevant molecular data for the predictive model due to its importance in understanding and optimising the health benefits of dietary phytochemicals. Phytochemicals are treated as xenobiotic species and therefore display transient presence in circulation 16 . Under this circumstance, the T max is of prime importance in predicting the presence of any phytochemicals with the expectation that it will be substantially eliminated after a few hours or a few days depending on the phytochemicals 18,19 . The protective efficacy of dietary phytochemicals can mitigate oxidative stress and inflammation (OSI) associated with daily activity and found consistently elevated in chronic diseases [7][8][9] . Managing OSI associated with daily activity is likely an important strategy for reducing disease risk in both healthy and unhealthy people. The time of maximal plasma concentration of dietary phytochemicals has recently been reported to have an important impact on their ability to regulate OSI 159 . Consumption of a strawberry drink 2 h before a high fat meal maximises protection against OSI compared with having the drink with or 2 h after the meal 159 , supporting that the T max of dietary phytochemicals must be matched to the OSI challenge for optimal health protection 159 . The T max of strawberry phytochemicals were reported to be about 1-2 h therefore consumption of the strawberry drink 2 h before the meal allowed their presence at maximal plasma concentration to reduce the OSI burden stimulated by the high fat meal 160 . Here, we chose T max instead of maximal plasma concentration (C max ) in the predictive model as T max seems to be less affected by dose. For example, T max of lycopene was reported to be about 5 h irrespective of the dose whilst C max increased with dose escalation 65 . Furthermore, the anti-OSI response of phytochemicals does not necessarily continue to increase with dose and higher concentrations of phytochemicals may become pro-oxidants and promote OSI [161][162][163] . Without good understanding of the target C max for maximising phytochemical efficacy, C max is less useful than T max .
Although the study is not concerned with post-primary absorption of phytochemicals formed during hepatic and microbial metabolism, it is acknowledged that these metabolites may also contribute to the regulation of OSI similarly to their parent compounds [164][165][166] . Therefore, it is important to consider the reported T max of these derived metabolites (not predicted by the model) together with T max of the parent compounds predicted by this model. The main hepatic metabolites of phytochemicals are glucuronide, sulphate and methylation derivatives with short T max values that range from 0.5 h to up to 2.5 h 42 , indicative of rapid clearance by the hepatic portal system. Colonic microbiota chemical transformations of phytochemicals include hydrolysation, reduction, ring-cleavage, demethylation and dihydroxylation of both parent compounds and their hepatic derivatives 167,168 . Accordingly, metabolites with T max > 5 h are likely to be absorbed or transformed with the involvement of the colonic microbiota 169 .
The ability to predict T max of dietary phytochemicals offers a valuable tool for designing clinical studies to capture the time of maximal phytochemicals in the human body and to avoid underestimation of their impacts on regulation of OSI. We propose that by matching T max to the biological cycle of OSI, suppression of OSI is maximised and the associated tissue damage would be minimised. Therefore, the strategy for optimising the protective efficacy of dietary phytochemicals involves selection of phytochemical sources to achieve desirable T max that target different needs for OSI regulation. Using the unique approach of combining phytochemical-rich foods based on computable physicochemical properties, we can understand the absorption characteristics of dietary phytochemicals to achieve their full potential for protective health benefits.

Methods
Clinical data collection. Clinical measures of T max were obtained from the literature using the PubMed database. Information collected included compound name and family, sources, dose, intake forms and T max in hours (as mean ± SE, hours). When T max was given as median and range, conversion to mean and SE was performed as described in Hozo et al. 170 . The inclusion selection criteria for publications included: 1) randomised controlled clinical trials in healthy volunteers; 2) inclusion of a wash-out period when the study followed a cross over design; 3) PCs analysed were passively absorbed, i.e., compounds found in the plasma or serum were unchanged from those ingested; and 4) plasma analysed without enzymatic deconjugation.
The data collected here were included in the training dataset.

Pearson correlation analysis between variables in the training dataset. Pearson correlation analy-
ses of all variables included in the training dataset were performed using the statistical package R version 3.3.2 171 . Results were reported as Pearson's correlation coefficient (r) and P-values.
Development of the predictive model. The predictive model was developed by a linear model framework using the statistical package R. The dependent variable T max required a natural logarithm transformation (ln(T max )) to capture the non-normality of errors in the variance across all observations of T max . The SE of each sample was used as weights during the regression modelling of T max . Because T max required a log normal distribution, and since: where E(Y) = expected value of y = mean(y) the calculated weights for the regression modelling were: when SE was missing, the weight was set to 4 and when SE was zero the weight was set to 400. Significance testing between T max and the physicochemical properties of phytochemicals was carried out using multivariate regression.
Power analysis of the predictive model. Post hoc power analysis of the predictive model was performed using the power calculation program G*Power 3.1.9.2 172,173 .
Validation of the predictive model. The prediction accuracy of the predictive model was validated using three independent datasets of measured T max obtained from clinical studies using the same selection criteria, including the PCv, PHv-fasted and PHv-fed datasets. Measured T max was collected as mean ± SE (hours). The prediction accuracy of the predictive model was evaluated by the normalised mean square weighted error (NMSWE) and % relative error of prediction for each dataset. The NMSWE of prediction was calculated: where w i is the weights calculated as in Equation 1, Y i is ln(T max_measured ), Ŷ i is ln(T max_predicted ) and N is the number of data points. Root NMSWE (RNMSWE) was calculated: Let Δ=RNMSWE of prediction. If ɛ is the error in predicted values of T max and ln(T max + ɛ) is predicted from the predictive model, then: Converting Δ (ln hours) to hours: The % relative error (RE) of prediction is an approximately averaged error over all data points in the dataset: