Using random forest to identify longitudinal predictors of health in a 30-year cohort study

Due to the wealth of exposome data from longitudinal cohort studies that is currently available, the need for methods to adequately analyze these data is growing. We propose an approach in which machine learning is used to identify longitudinal exposome-related predictors of health, and illustrate its potential through an application. Our application involves studying the relation between exposome and self-perceived health based on the 30-year running Doetinchem Cohort Study. Random Forest (RF) was used to identify the strongest predictors due to its favorable prediction performance in prior research. The relation between predictors and outcome was visualized with partial dependence and accumulated local effects plots. To facilitate interpretation, exposures were summarized by expressing them as the average exposure and average trend over time. The RF model’s ability to discriminate poor from good self-perceived health was acceptable (Area-Under-the-Curve = 0.707). Nine exposures from different exposome-related domains were largely responsible for the model’s performance, while 87 exposures seemed to contribute little to the performance. Our approach demonstrates that ML can be interpreted more than widely believed, and can be applied to identify important longitudinal predictors of health over the life course in studies with repeated measures of exposure. The approach is context-independent and broadly applicable.


Supplementary Text S1. Statistical analysis
In the next section, we describe the different steps towards a prediction model for self-perceived health based on demographic, lifestyle, environmental, and biological exposures using RF. We begin by explaining how longitudinal exposures were assessed (Step 1), followed by a description of the RF algorithm (Step 2), how it was optimized using tuning parameters, and how its prediction performance was determined (Step 3). This is followed by an explanation of how the importance of the individual exposures in predicting health was assessed (Step 4) and what exposures needed to be included to create a parsimonious model with good performance using exposure selection through cross-validation (Step 5). Lastly, the visual representation of the relation between the most important exposures and health using partial dependence and accumulated local effects plots is explained (Step 6).

Step 1: Assessing longitudinal exposures
Most exposures were assessed during multiple measurement rounds. Using the exposure variable set as-is implies that each exposure at a given round is considered a potential predictor. To facilitate interpretation, we pre-processed and summarized continuous exposures, by introducing the socalled Area-Under-the-Exposure (AUE) and the Trend-of-the-Exposure (TOE).
The AUE represents the average of the continuous exposure during the rounds leading up to the round at which the self-rated health outcome is observed. Underlying this construct is the assumption that prolonged exposure over time is in particular predictive of the outcome (selfperceived health). The AUE is computed for each exposure variable and for each individual separately, by plotting observed exposure values against rounds, connecting the values with lines, and determining the average area under these lines (that is, the total area under the lines divided by the number of exposure measurement rounds minus one). The higher the AUE, the higher the prolonged exposure over the life course.
The TOE represents the average trend in the exposure. Here the assumption is that a positive or negative trend is also predictive of the outcome. The TOE is computed for each individual and exposure, through determining the slope in exposure for each pair of subsequent rounds (slope for round 1-2, for round 2-3, etc.), and taking the average over that. A positive value for TOE indicates an upward trend in exposure, whereas a negative value indicates a downward trend.
For categorical exposure variables the AUE and TOE were defined differently. The AUE was defined as the proportion of rounds that the individual occupied a certain state (which corresponds to a unique category of the variable). For instance, for the smoking variable the AUE was defined as the proportion of time that the individual was a smoker. The TOE was defined as an indicator variable, that signified whether a change from one reference category to another category had occurred during the rounds. For instance, whether an individual had gone from 'no smoking' to 'smoking'. For every categorical exposure, one TOE variable indicating a change from a reference category to another category was included.
Supplementary Figure S1 and Supplementary Table S1 provide an example of the calculation of the AUE and TOE for a continuous exposure (BMI) and a categorical exposure (smoking status), respectively.
An advantage of this approach is that the AUE and TOE can also be calculated in case of missing values, which is a common problem in longitudinal cohort studies. To calculate the longitudinal exposures, participants had to have a value for the exposure in at least two rounds. Participants with missing values for an exposure on 4 or 5 rounds were labelled as missing (i.e. 99999, this is an outlier value for continuous exposures and a missing category for categorical exposures) on that particular longitudinal exposure. In total, 75% of the participants had no missing values on the longitudinal exposures, 7% of the participants had a missing value on one longitudinal exposure, 5% on two longitudinal exposures, and 13% on three or more longitudinal exposures. Figure S1. Hypothetical example in different individuals of the Area-Under-the-Exposure (AUE) and Trend-of-the-Exposure (TOE) for the continuous exposure body mass index (BMI), showing a low AUE and increasing TOE (a), a low AUE and decreasing TOE (b), a high AUE and increasing TOE (c), and a high AUE and decreasing TOE (d). A missing point at a certain round indicates a missing value. In our approach we included measures that represent the average value and average trend of an exposure measured during multiple rounds. If considered appropriate, other summary measures could also be included, for example a measure of the standard deviation of the AUE and TOE to include variation in these measures. Furthermore, in our approach the AUE and TOE were based on exposures measured in round 1 through round 5 while the outcome measure was assessed at round 6. However, one could also consider including exposures assessed at round 6 in the summary measures.

Step 2: Choosing a ML algorithm: random forest
To analyze what longitudinal exposures had the greatest predictive value for self-perceived health, the random forest (RF) algorithm was used 1 . This non-parametric machine learning algorithm is one of the top-performing algorithms in classification problems 2 and consists of an ensemble of decision trees that predict the outcome measure. In our study, decision trees can be used to classify participants as having a good vs. poor perceived health status based on the values of the longitudinal exposures. Decision trees are an easy method to interpret, but they are prone to overfitting and thus yield predictions that are not easily generalizable to datasets other than the dataset on which they were built. RF deals with this limitation by building a forest of decision trees 1 . To this end, RF first creates a bootstrap sample by randomly selecting individuals, with replacement, from the original dataset. The bootstrap sample consists of the same number of observations as the original dataset.
On each bootstrap sample a decision tree is built by repeatedly making binary splits (i.e. dividing the data into two partitions). Each split is made by first taking a random subset of exposures, then picking one exposure and an accompanying split point such that a pre-specified criterion is optimized. For RF, the Gini Impurity measure is the default criterion to measure how well the split is able to distinguish good from poor health 1 . Splits are made until a certain tree depth is reached, or when a partition has reached a minimum node size (see Step 3). The final partitions are called terminal nodes. A prediction for a new observation can be made by identifying the partition in which the new observation falls, and then by assigning the majority vote in the partition to the new observation. Within RF, the step of creating a bootstrapped dataset and building a decision tree on it is repeated many times, resulting in a forest of trees. The entire forest is used to obtain predicted probabilities of poor health. For every individual, the predicted class (i.e. good or poor health) by each decision tree is obtained, and then the proportion of trees that predict poor health is used as the predicted probability of poor health.
RF typically measures the prediction performance through the so-called out-of-bag (OOB) error 1 . A limitation of using the OOB error as prediction performance metric is that it is directly tied to accuracy (number of individuals whose class is predicted correctly) and can be misleading when the outcome measure in the dataset is imbalanced. In our study, nearly 16% of individuals had poor perceived health. A 'dummy' (and naïve) classifier would classify all individuals as having good health, leading to an error rate of 16%. If the RF achieves an OOB of 16%, it would seem that it performs quite reasonably, when it actually does no better than the dummy classifier. To overcome this issue, we made use of the Receiver Operating Characteristic (ROC) curve and its Area-Under-the-Curve (AUC) instead. The AUC is not dependent on the prior class probabilities. Next to the AUC, we reported the sensitivity, specificity, and accuracy belonging to (a) the optimal threshold that we defined as the threshold in the ROC curve for which the sum of sensitivity and specificity is maximized, and (b) a predefined threshold of 0.5.
To examine the extent to which the observed outcomes and predicted probabilities were in agreement, the average predicted risk was compared with the overall outcome rate (i.e. calibrationin-the-large) 3 . Furthermore, a calibration curve was constructed plotting the predicted probabilities against the percentage of observed outcomes.

Step 3: Optimizing prediction performance
In general, the prediction performance of the RF algorithm is adequate when using the default setting for its tuning parameters 4 . However, these parameters can be tuned to further improve prediction performance. Commonly used tuning parameters are mtry, ntrees, nodesize and maxnodes 4,5 . The random subset of exposures that is used at each split of the tree can be altered, this is called mtry. The default of mtry for classification (i.e. categorical outcome) is the square root of the number of included exposures. The number of trees (ntrees) of which the RF consists should be set sufficiently high (default = 500). Nodesize, the minimum number of observations in the final nodes (the leaves of the tree), is usually set at 1 for classification and at 5 for regression (i.e. continuous outcome). By adjusting maxnodes, which is the maximum number of terminal nodes that trees in the forest can have, the depth of the trees can be controlled. By default, trees are grown to the maximum possible.
In order to choose the optimal parameter settings, we randomly divided the dataset in a training dataset and a test dataset using a 80%/20% split with a similar distribution of the proportions of good/poor perceived health in both datasets. Next, we selected the combination of settings for the tuning parameters that produced the highest prediction performance on the training dataset with a grid search in combination with 5-fold cross-validation (the choice of k in k-fold cross-validation is usually 5 or 10) with R-package caret 6 . Lastly, the model with the optimal parameter settings was used to make predictions on the test dataset and the corresponding ROC curve and AUC were determined.

Step 4: Ranking the variable importance
One of the primary outcomes of RF is the variable importance ranking, which reflects a ranking of the importance of the exposures in the prediction performance of the RF. For classification, the variable importance ranking plot shows a list of 'most relevant' variables, that are ranked by mean decrease in accuracy (MDA) that occurs when the particular exposure is permuted randomly in the RF. If the exposure is strongly predictive of the outcome, the random permutation will lead to a large MDA. As the MDA indicates how much accuracy the prediction model losses by removing each exposure, it provides insight into the additive predictive value of a particular exposure in addition to all other exposures. Variables with a large MDA can be considered as strong independent predictors of the outcome. The variable importance ranking can thus be used to investigate and identify associations between exposures and the outcome. Besides the effect size of the exposure, the RF variable importance ranking also automatically captures non-linear and interaction effects without hard-coded specification of these effects 4 . We obtained the variable importance ranking by taking the optimal tuning parameter settings and fitting a RF on the entire dataset, analogous to what would be done in typical epidemiological/statistical association studies. In the current study, we show the 30 top-ranked exposures in the variable importance.

Step 5: Selecting exposures through cross-validation
A good prediction model is characterized by its ability to strike a balance between prediction accuracy and parsimony, which means that exposures that do not or hardly contribute to the prediction performance of the model should be excluded. In other words, the parsimonious model would be at least not (substantially) worse than a model with all variables. The parsimonious model cannot be obtained through the variable importance ranking (alone), but must be obtained by considering the number of variables included in the final prediction model as a tuning parameter (q), and evaluating the relation between q and the prediction performance through a separate procedure. This involves (a) creating 5 partitions in the 80% training dataset, (b) forming a temporary training dataset based on 4 partitions and a temporary validation dataset based on 1 partition, and training a RF, (c) choosing a value of q, and selecting the top-ranked q variables from the variable importance ranking from the trained RF, (d) building a new model on the training dataset with just the q variables, and (e) making predictions with this new model on the validation dataset. This procedure is repeated until all partitions have been used as a validation dataset, and until all q values have been used. Afterwards, the AUC was estimated for each choice of q, and plotted against each other. The optimal value for q was chosen based on the flattening of the resulting curve. Next, the optimal value for q was used to make predictions on the 20% test dataset and to determine the corresponding ROC curve and AUC.

Step 6: Plotting partial dependence plots and accumulated local effects plots
The variable importance ranking identifies the most important exposures that predict self-perceived health. However, it does not provide information about the shape of the relation between the exposure and self-perceived health. To visualize this relation, partial dependence plots (PDP) 7 and accumulated local effects (ALE) plots were produced 8 . PDP plots the value of the average predicted outcome on the y-axis against each value of the exposure on the x-axis. The average predicted outcome for a given exposure value was obtained by setting the exposure in the sample population to that value and leaving all other exposures unchanged, and then to generate predictions, and average over them. On the other hand, ALE plots only look at the local effects of an exposure, i.e. the effect is estimated in a subpopulation that is located within a certain range of the exposure. An advantage of the ALE plots is that they largely avoid extrapolation of the effect at values of the exposure that do not occur in (combination with certain values of another exposure in) the dataset, which is especially a problem when there are highly correlated exposures 8 . However, a consequence of this is that the local effects are only applicable to the specific subpopulation for which it was calculated, and therefore it is difficult to interpret and compare the size of different local effects. In the current study, both PDPs and ALE plots were plotted for the number of most important exposures selected through cross-validation as described above. The PDPs provide a general sense of the effect size, while the ALE plots were used to check whether the slopes as observed in PDPs are possibly the result of extrapolation issues.
Supplementary  (1.256) and the lowest specificity and sensitivity at the predefined point of 0.5 (0.730 respectively 0.685), these prediction performance metrics did not differ substantially from the other models.  Figure S2. Calibration curve showing the predicted probabilities and the observed outcome (i.e. poor self-perceived health) percentages in the test dataset. The dotted line represents perfect calibration. None of the participants had a predicted probability of 60% or higher, and only one participant was included with a predicted probability >55%, explaining the spike at the end of the line. Figure S4. Distribution of the values of the predictors of self-perceived health. Sleep.AUE is the sleep duration in hours per day where 1=≤5, 2=6, 3=7, 4=8, and 5=≥9 hours of sleep. AUE, Area-Under-the-Exposure; BMI, body mass index; r5, round 5; WHR, waist/hip ratio.