Gastrointestinal Spatiotemporal mRNA Expression of Ghrelin vs Growth Hormone Receptor and New Growth Yield Machine Learning Model Based on Perturbation Theory

The management of ruminant growth yield has economic importance. The current work presents a study of the spatiotemporal dynamic expression of Ghrelin and GHR at mRNA levels throughout the gastrointestinal tract (GIT) of kid goats under housing and grazing systems. The experiments show that the feeding system and age affected the expression of either Ghrelin or GHR with different mechanisms. Furthermore, the experimental data are used to build new Machine Learning models based on the Perturbation Theory, which can predict the effects of perturbations of Ghrelin and GHR mRNA expression on the growth yield. The models consider eight longitudinal GIT segments (rumen, abomasum, duodenum, jejunum, ileum, cecum, colon and rectum), seven time points (0, 7, 14, 28, 42, 56 and 70 d) and two feeding systems (Supplemental and Grazing feeding) as perturbations from the expected values of the growth yield. The best regression model was obtained using Random Forest, with the coefficient of determination R2 of 0.781 for the test subset. The current results indicate that the non-linear regression model can accurately predict the growth yield and the key nodes during gastrointestinal development, which is helpful to optimize the feeding management strategies in ruminant production system.


Results and Discussion
Spatiotemporal mRNA expression of Ghrelin and GHR. During all the three stages of development, Ghrelin was expressed throughout the GIT of kid goats, with a greater expression of Ghrelin (P < 0.01) observed in the abomasum than those in the other remaining segments (Table 1). This was in accordance with a previous study in sheep 18 . The predominant expression of Ghrelin mRNA in the abomasum was consistent with the findings observed in humans and rats. As the abomasum of ruminants is functionally like the stomach of non-ruminants, the predominant expression of Ghrelin mRNA in the abomasum of ruminants was also consistent with the greater expression of Ghrelin in the stomach of non-ruminants (humans and rodents) 1,33,34 . This implied that similar organs evolved according to similar working mechanisms. The Ghrelin was moderately expressed in the duodenum and jejunum. This was consistent with the results of Ghrelin expression in the small intestine of humans and rats 34,35 . Meanwhile, the expression of Ghrelin in the abomasum was increased (P < 0.001) with age (from pre-rumination phase to rumination phase); however, the mRNA expression of Ghrelin in the other segments was almost unchanged (P > 0.05) with age (Table 1). A more intuitional predicted presentation of spatiotemporal mRNA expression of Ghrelin throughout the GIT of kid goats is shown in Fig. 1A. A similar sequential, dynamic, and developmental change of Ghrelin was also observed in sheep by Huang et al. 18 , in whose study the Ghrelin mRNA levels were gradually increased in the abomasum in accordance with growth curve during early developmental periods.
Like Ghrelin, GHR was expressed throughout the GIT of kid goats during three developmental stages (Table 2), and a more intuitional predicted presentation of spatiotemporal mRNA expression of GHR is shown in Fig. 1B. These results are in agreement with previous studies carried out in the GIT of humans and rats 36,37 . The widespread expression of GHR in the GIT suggested the regulatory roles of GH on digestive and immune functions, including metabolism, growth, or differentiation 5,38 . During pre-rumination phase, the expression of GHR in the abomasum, duodenum and jejunum was greater (P < 0.01) than those in the other GIT segments; similarly, GHR tissue distribution pattern was observed during transition and rumination phases, with relatively greater expression in the abomasum, followed by the duodenum, colon and rectum. Furthermore, the expression of GHR in major segments of GIT (except for the colon and rectum) was reduced (P < 0.05) with age ( Table 2). In porcine gastric tissue, the gastric GHR mRNA expression was found to be significantly correlated with the relative gastric weight (r = 0.541) 39 .
Scientific RepoRts | 6:30174 | DOI: 10.1038/srep30174 Effects of feeding systems on mRNA expression of Ghrelin and GHR. During time interval d 28-70, the mRNA expression of Ghrelin in the abomasum, duodenum and jejunum was affected (P < 0.01) by both feeding system and age, with relatively greater expression in the G group than those in the S group (Table 1). The expression of GHR at mRNA level was affected by the feeding system (P ≤ 0.001) in all segments of the GIT and by age (P < 0.05) in major segments of GIT (except for the jejunum and rectum), with greater expression in  4 and P value 4 represent SEM and P value for different tissues at each developing stage. A,B Means in the same row not bearing a common superscript letter differ (P < 0.05); a-c Means in the same column not bearing a common superscript letter differ (P < 0.05); S, supplemental feeding; G, grazing; L = Linear effect of age, Q = Quadratic effect of age. the S group than in the G group in all GIT segments (Table 2). This suggested that the expression of Ghrelin and GHR was reversely affected by feeding type and age. Sugino et al. 15 reported that the expression of Ghrelin can be modified by the feeding regimen in sheep; the Ghrelin secretion levels before prandium are higher in animals fed twice daily than those in animals fed four times daily 40 . Since Ghrelin usually acts as a starvation signal, it was reasonable that reduced Ghrelin mRNA expression was observed in the S group supplemented with concentrates in the current study. It has long been accepted that the GHR abundance is submitted to a developmental and nutritional regulation in a tissue-specific manner 4 . However, few works studied the GHR expression in the GIT under different nutritional status. The present results implied that supplemental feeding could increase the GHR mRNA expression in the GIT. However, there were no feeding system × age interactions (P > 0.05) in the GIT segments on either Ghrelin or GHR expression, except for Ghrelin expression in the duodenum (P < 0.01) and GHR expression in the rumen (P < 0.001) and jejunum (P < 0.05). This further implied that feeding system and age affected the expression of either Ghrelin or GHR with different mechanisms. Furthermore, from time points d 0 to 70, the expression of Ghrelin in the abomasum was quadratically increased. The same increase pattern in Ghrelin expression was also observed in the duodenum (both S and G groups) and jejunum. The expression of GHR in the rumen and duodenum was affected by age, with a linear decrease for both S and G groups (P < 0.01), as well as a quadratic decrease for the G group (P < 0.01). In the abomasum, jejunum, ileum and cecum, the expression of GHR at mRNA level was affected by age linearly (P < 0.01); the quadratic decrease was observed in the jejunum, ileum, cecum and colon (P < 0.05).

Model construction
Dataset. The present study first uses reported data of the Ghrelin and GHR gene mRNA expression throughout the GIT of kid goats, and our previous work data 27 of Live weight vs Carcass weight (L w vs C w ) that were employed as a dataset for Y(ζ k ) model construction. Different experimental conditions such as the longitudinal GIT segments (s), the postnatal time (t), and the feeding method (m) were defined as deviations or perturbations of the Y(ζ k ) model (where ζ = s, t, and m). Each perturbation had several levels (k), expressed as k′ , k″ and k″′ , respectively. In detail, s k′ represents 8 different segments of the GIT (k′ = 1, 2, 3, 4, 5, 6, 7 and 8; 1 = rumen, 2 = abomasum, 3 = duodenum, 4 = jejunum, 5 = ileum, 6 = cecum, 7 = colon and 8 = rectum  Table 2. The effect of Supplemental vs Grazing feeding system on the mRNA expression of growth hormone receptor (GHR) and tissue distribution and dynamic developmental changes of GHR mRNA expression during different stages of development SEM 1 represents SEM for System × Age (from 28 to 70 d of age) on GHR expression; P value 1 represents P value for both treatment groups from 28 to 70 d of age on GHR expression; SEM 2 and P value for age 2 represent SEM and P value for age from 0 to 70 d; SEM 3 and P value 3 represent SEM and P value for relative GHR expression values at different development stages; SEM 4 and P value 4 represent SEM and P value for different tissues at each developing stage. A,B Means in the same row not bearing a common superscript letter differ (P < 0.05); a-c Means in the same column not bearing a common superscript letter differ (P < 0.05); S, supplemental feeding; G, grazing; L = Linear effect of age, Q = Quadratic effect of age.
Scientific RepoRts | 6:30174 | DOI: 10.1038/srep30174 feeding and Grazing). A total of samples N s = 352 collected from 44 (N a = 44, N a = number of experimental animals) kid goats were studied for the mRNA expression of Ghrelin and GHR. The detailed full dataset was provided in online supplementary material SM01 41 . In order to carry out a perturbation theory analysis, a pair-wise analysis of query samples vs reference samples was done by using a previous dataset of N s to construct a two-block dataset (see SM01 41 ), with the number of perturbation cases N c = 123 872 pairs of query and reference samples selected randomly from the 352 samples. The details about PT models could be found in our previous works 28,42 .

Regression models. The schematic diagram of the present work aimed at developing a new Expected
Measurement Moving Average-Machine Learning (EMMA-ML) model to predict growth yield Y(ζ k ) is presented in Fig. 2. Box-Jenkins Operators and PT were used to handle the deviations or perturbations in the current study.
In addition, Moving Average (MA) was also used to account for the "small" deviations of different growth stages, the spatiotemporal factors on the Ghrelin and GHR mRNA expression. Several types of predictive model were presented in the study for growth ratio-yield Y(ζ k ) based on the perturbation or variations of different experimental conditions (ζ k = s k′ , t k″ , m k″′ ). In the first step, Y(ζ k ) exp was defined as the expected Y(ζ k ) values under a set of given experimental conditions (ζ k = s k′ , t k″ , m k″′ ). Then, MA was used to define the "small" deviations under the different conditions (ζ k ). Finally, a general formula of EMMA models for predicted growth yield Y(ζ k ) pred . Eq. 1 showed the linear model constructed by setting the mRNA expression of Ghrelin and GHR as variables input.
k pred 0 0 k exp g 1 2 g g k g 1 2 g g k Y(ζ k ) pred represents the predicted value of Y(ζ k ). The coefficients for each input variables in this general equation are e 0 , a 0 , a g and b g . The subscript "g" refers to two input variables: the mRNA expression of Ghrelin (g = 1) and GHR (g = 2). The expected value of Y(ζ k ), Y(ζ k ) exp , is the first class of input variables. V g (ζ k ) represents the mRNA expression of Ghrelin or GHR, the second class of input variables. Δ V g (ζ k ) are the perturbation values and the third class of input variables. In Eq. 2, the general equation was expanded according to a set of experimental conditions (ζ k ).
k pred 0 0 k exp g 1 2 g g k g 1 2 g g k g 1 2 g g k g 1 2 g g k e 0 , a 0 , a g , b g , c g and d g are the coefficients for the corresponding input variables and Y(ζ k ) exp represents the expected values of Y(ζ k ) in a set of reference experimental conditions. V g (ζ k ) represents the input variables in a query (set of conditions) (ζ k = s k′ , m k″ , or t k″′ , respectively). Another class of input variables, Δ V g (s k′ ), Δ V g (t k″ ) and Δ V g (m k″′ ), refers the perturbation values in a set of experimental conditions s k″ , t k″ and m k″′ for each V g (ζ k ). < V g (ζ k )> is the MA/Box-Jenkins Operator for variable V g (ζ k ) (Eq. 3), which can be calculated with all the perturbation cases under the same experimental conditions 28,43 .
Thus, Y(ζ k ) exp , V g (ζ k ) and Δ V g (ζ k ) were employed as input variables to develop a new Machine Learning predictive model using Statistica 6.0 44 and RRegrs package 45,46 .
Mapping Ghrelin/GHR vs Yield. The first tested method was the General Multilinear Regression (GRM) from STATISTICA. The model predicted the effects of spatiotemporal perturbations of Ghrelin and GHR mRNA expression on Y(ζ k ) and it is presented in Eq. 4. For the sake of simplicity, the output/features have the compacted notations in the model analysis: influenced the growth yield. More specifically, the growth yield Ypred was increased with the increasing mRNA expression of expected values < V 2 (s k′ )> and < V 2 (t k″ )> for different GIT segments and age, but the growth yield was decreased with the expected values of < V 2 (m k″′ )> for different types of feeding management. Expected values < V 2 (ζ k )> of GHR mRNA expression corresponding to the growth yields are shown in Table 3. In addition, the mRNA expression of GHR in the abomasum and duodenum could positively increase the growth yield. This was in accordance with previous reports claiming that GH has proliferative effects on the intestinal epithelium, and influences enteroendocrine cell secretion, calcium absorption, and intestinal amino acid and ion transport 5 . The P-values of the features show that only Y exp and V 1 are important for the linear model. Thus, the linear model shows that mRNA expression of Ghrelin is more important for the growth yield compared to the perturbations of GHR mRNA expression. Thus, it can be stated that the perturbations of GHR mRNA expression need non-linear modelling for the growth yield production.
In addition to the GLM method from STATISTICA, two types of neural networks were tested with the normalized dataset (see Table 4): Linear Neural Network (LNN, no hidden layers) and Multilayer Perceptron (MLP, with at least one hidden layer). Both models presented R 2 test values between 0.529 and 0.539. The models have the same problem as the GRM one; the perturbations improve a little the model performance: LNN − > MLP with 1 hidden layer − > MLP with 2 hidden layers (with the same number of inputs). The best MLP model was MLP 5:5-15-12-1:1: 5 inputs, 15 neurons in the hidden layer 1, 12 neurons in the hidden layer 2, R 2 test = 0.539. As expected, LNN 5:5-1:1 is the linear combination of the input features and it showed similar results to GRM with R 2 = 0.534 (see details in SM03 48 ).
In order to test different complex regression methods, seven regression RRegrs methods 45,46 were used to build prediction models: Multiple Linear regression (LM), Generalized Linear Model with Stepwise Feature Selection (GLM) 50 , Partial Least Squares Regression (PLS) 51 , Lasso regression (Lasso) 52 , Elastic Net regression (ENET) 53 , single hidden layer Neural Networks regression (NN) 54 , and Random Forest regression (RF) 55 . Table 5 shows RMSE and R 2 values for the training and test subsets. LM, GLM PLS, Lasso and ENET provided models close to the GLM from STATISTICA: R 2 test = 0.533-0.534 and RMSE test = 0.0992-0.0994. Lasso is not able to provide a model that includes the system perturbations and it used only one feature, Y exp . ENET was based on only two features (Y exp and V 1 ) with the same R 2 test as LM/GLM. The feature importance analysis for GLM (Fig. 3A) pointed out the natural importance of Y exp and the preference of using Y exp and V 1 , obtaining similar results to those obtained with MLP in STATISTICA. Similar Y exp feature importance is presented by Lasso and ENET. NN with one hidden layer provided a very small improvement, but with the same power as MLP 5:5-15-12-1:1 with 2 hidden layers from STATISTICA. Figure 3B shows the NN feature importance. Several parameters were tested to find the optimal topology of NN (see The best regression performance was obtained with the RF regression method: R 2 test of 0.629 and RMSE test of 0.0886 using five trees and all features. The variation of the RMSE test with the number of features selected as inputs for the trees (RF parameter mtry) is presented in Fig. 3D. Thus, two features were used as input for each tree.
These statistics show the difficulty of the linear models to establish a relationship between the output and the features. The strong preference for Y exp is natural because it was calculated based on the observable output values. Thus, the linear models have difficulty to include perturbation of V 2 (GHR mRNA expression) in the model and only RF method was able to improve the regression model performance with R 2 test > 0.60 (still low performance). Table 3. Expected values <V 2 (ζ k )> of GHR mRNA expressions corresponding to growth yields. a Lactation represents the suckling periods of the goats (0-20 d), Housing refers to the goat with housing feed management, Grazing refers to the goats with grazing feed management. b The green color means the strong/positive effect on growth yields, whereas, red color represents the poor/negative effect on growth yields. * The mark * means the mRNA expression of GHR under this condition corresponds to the higher growth yields. L, lactation; S, supplemental feeding; G, grazing. The results obtained with STATISTICA and RRegrs for the linear and non-linear regression models demonstrated the relatively moderate prediction power of the models using the original dataset, with a maximum of 0.629 for R 2 test (RF): the model explains only 62.9% of the response data variability around its mean. For this reason, a pre-processing of the original datasets was used to improve the dataset quality by removing the outliers. Calculating the Pearson residuals for the fitted values, a filter was used for the cases with residuals which are twice (0.4) as high as the maximum residual of the majority of data (0.2). Therefore, only 2.3% of the cases were filtered, resulting in a new dataset of 121,056 cases. The raw and normalized corrected datasets were the inputs for the same linear and non-linear regression methods from STATISTICA and RRegrs tools.

Model
In the first step, STATISTICA neural networks were tested for the normalized filtered dataset (see Table 6 and SM03 48     In the second step, the filtered dataset was processed with seven RRegrs methods: LM, GLM, PLS, Lasso, ENET, NN, and RF (Table 7). In addition, Fig. 4 presents the pairwise model comparisons of R 2 test (A) and RMSE test (B). The average performance value (dot) with two-sided confidence limits was computed using Student's t-test with Bonferroni multiplicity correction. Even if the LM, GLM, PLS and ENET improve the model performance with over 0.04 of R 2 test compared to the non-filtered dataset; the results are still moderate with the R 2 test very close to 0.60. Lasso failed to select the system perturbation, in agreement with the previous results. NN model makes a clear difference with R 2 test of 0.702 and RMSE test = 0.117. NN has the topology of 5-20-1 (5 inputs, 20 neurons in one hidden layer, weight decay = 0.005). Figure 5A presents the NN parameter study using 5, 10, 15, 20 and 50 hidden layer neurons and weight decays of 0.0, 0.0001, 0.001 and 0.005. This performance is similar to the one obtained with MLP from STATISTICA (R 2 test = 0.704). The best model to predict growth yield (Y pred ) was provided by RF method based on all features and five trees, with R 2 test of 0.781 and RMSE test of 0.101. By decreasing the number of trees to one, R 2 test is lower (0.749) and RMSE test is higher (0.109). Thus, the increase of the RF model complexity from one to five trees is needed to improve the regression performance. If the number of trees is increased to 10, R 2 test becomes higher, with a value of 0.783 and RMSE test lower (0.100). The difference between our best RF model with five trees and the one with ten trees is not in agreement with the increase of the model complexity. Thus, no further numbers of trees were tested with RF. The model can be downloaded from Figshare (SM04 56 ).
Thus, the prediction regression model is constructed using mRNA expression in different parts of the GIT (spatial variation) and at different time-points (time variation). These are the spatiotemporal variations of the mRNA expression. The model input variables are MA values that are calculated as differences of the original values and the averages of the variables in eight different segments of the GIT, seven different sampling time points and two different feeding systems. The best final model could be used to predict the growth yield using new Ghrelin and Growth mRNA expressions measured under the model experimental conditions: one of the eight segments of the GIT, one of the seven sampling time points and one of the two feeding systems. The model input features as MA values will be calculated using the model averages of Ghrelin and Growth mRNA expressions from the training dataset. Introducing these MA values in the RF model, the growth yield can be predicted.
In conclusion, the current study investigated the tissue distribution and sequential dynamic developmental changes of Ghrelin and GHR mRNA expression. The factors of spatiotemporal development of GIT were taken into account, along with Supplemental feeding vs Grazing feeding systems, and new Machine Learning models were developed in order to predict the growth yield depending on the mRNA expression of Ghrelin and GHR after perturbations/variations of different conditions. Using linear and non-linear Machine Learning methods, it was found that the Random Forest method provided the best regression prediction model with R 2 test > 0.78. In addition, this model also revealed that the mRNA expression of GHR could positively reflect the rate of growth yield, and it is crucial during the processes of growth and development in ruminants.     expression of Ghrelin and GHR in the GIT of kid goats. After birth, the kids were separated from the dams and trained to suckle milk from nipple pails. Detailed feeding management, ingredients of concentrate starter and forage (mainly Miscanthus) have been described in our previous parallel study 27 . All goats had free access to water. Sample Collection. Mucosa samples of different GIT longitudinal segments (i.e., rumen, abomasum, duodenum, jejunum, ileum, colon, cecum, and rectum) were collected immediately after slaughter. The collected samples were wrapped with sterilized tinfoil and snap-frozen in liquid nitrogen and stored at − 80 °C until RNA isolation.

RNA Isolation and cDNA Preparation.
Total RNA was extracted from collected mucosa samples using TRIZOL (Invitrogen, California, USA) according to the manufacturer's instructions. After genomic DNA was eliminated by digestion with DNase I (Thermo Scientific, Waltham, USA), the RNA quality and quantity were determined. Afterwards, 1 μ g of the extracted RNA was reverse-transcribed to synthesize tissue specific cDNA using PrimeScript ™ RT reagent Kit (Takara, Dalian, China) immediately. Briefly, a 20 μ L reverse transcription mixture that contained 1 μ g of total RNA, 2 μ L 5 × gDNA Eraser Buffer, 4 μ L 5 × PrimeScript Buffer, 1 μL PrimeScript RT Enzyme Mix, 1 μ L RT Primer Mix and 10 μ L RNase Free dH 2 O was prepared. This reaction mixture was incubated for 2 min at 42 °C, followed by a reverse transcription step for 15 min at 37 °C, and a final heating step at 85 °C for 5 s to stop the reaction. The prepared cDNA samples were stored at − 20 °C until subsequent quantitative real-time PCR analysis. Primer Design. Primers for quantitative real-time PCR analysis were designed according to Ghrelin (Accession No.: AB089200) and GHR (Accession No.: NM_001285648) gene sequences of Capra hircus reported online. β -Actin (Accession No.: NM_001009784.1) was used as a housekeeping gene in quantitative real-time PCR analysis. All primers were synthesized by Sangon Biotech (Sangon Biotech, Shanghai, China), and the primer sequences are shown in Table 8.
Quantitative Real-Time PCR Analysis. The quantitative real-time PCR was performed on an ABI-7900HT qPCR system (Applied Biosystems, Foster City, CA, USA) using FG POWER SYBR GEREEN PCR MASTER MIX (Applied Biosystems, Foster City, CA, USA). The quantification of the PCR products of Ghrelin and GHR genes was evaluated in comparison with the PCR products of β -actin. The relative changes Statistical Analysis. The effect of the feeding system (S vs G) on the expression of GHR and Ghrelin was examined from time points d 28 to 70. The data were analyzed as a completely randomized design with the MIXED procedures of SAS (SAS Inst. Inc., Cary, NC) with a model that included the fixed effect of feeding system, age, and the feeding system × age interaction, with an individual animal as the experiment unit, as described in our previous study 27 . Orthogonal contrasts were used to test the linear and quadratic effects of age. The effects of age was tested with animal nested within age as the random effect and individual animal as the experimental unit. Statistical significance was defined as P < 0.05. The advantage of MIXED from general linear model (GLM) is it can handle correlated data and unequal variances, and it encompasses all models in the variance components procedure. In the linear mixed-effects model, the responses from a subject are the sum of fixed and random effects. The fixed effect affects the population mean and the random effect is associated with a sampling procedure. Another difference between MIXED and GLM is that MIXED is based on maximum likelihood (ML) and restricted maximum likelihood (REML) methods, while GLM uses the analysis of variance (ANOVA) methods. ANOVA can deal with balanced designs, whereas ML and REML are efficient with balanced and unbalanced designs (modeling real data). The full dataset with the experimental results is available as SM01 41 .
Machine Learning Models. In the first step, the raw dataset was used to create the corresponding normalized dataset and the training and test sub-sets (using an R script): 75% training set (train) and 25% test set (test) (SM02 47 ). The datasets and the R script are available in Figshare (SM04) 56 .
Two pieces of software were used to build regression models: STATISTICA and RRegrs. With STATISTICA multilinear and neural network regressions were used. The resulting models were characterized by the R test values (regression coefficient for test subset). In addition, the corresponding R 2 test values were added to the standard STATISTICA outputs using R test .
RRegrs is an R integrated framework used to create ten linear and non-linear regression models 45,46 . Due to the computational limitations generated by the big datasets, only seven RRegrs methods were used: Multiple Linear regression (LM), Generalized Linear Model with Stepwise Feature Selection (GLM) 50 , Partial Least Squares Regression (PLS) 51 , Lasso regression (Lasso) 52 , Elastic Net regression (ENET) 53 , Neural Networks regression (NN) 54 , and Random Forest (RF) 55 (SM03 48 ). In general, default values of specific parameters were used for the regression methods. For some procedures, particular variations of the method parameters were studied in order to provide the best possible regression model. The standard RRegrs call is not prepared for big datasets. Thus, individual RRergrs methods were used with only one training/test split, without several RRegrs features (Y randomization, RRegrs plotting, data scaling, near-to-zero variance filtering, feature correlation removal), and using specific method calls (RRegrs::Method, where Method is the name of the regression function in RRegrs). The graphics were constructed externally by saving the generated model objects. The criteria to find the best model are the same as for RRegrs: maximum R 2 test and minimum RMSE test . The dataset splitting and the RRegrs results can be reproduced by using the same parameters and values of the seeds in the scripts. The best model available as Figshare item can be downloaded and studied with R for other statistics (SM04) 56 .
In order to compare them with the STATISTICA results, additional R.ts values were calculated as sqrt(R 2 test ). Thus, in the results from STATISTICA and RRegrs, both R and R 2 values have been reported. The importance of the features for the RRegrs models was calculated with caret functions varImpPlot(fited.model) and varImp(fited. model), where "fited.model" is the fitted model for the regression methods. The residual plot to remove the outliers and the best model (RF) are available in Figshare (SM04) 56 .