Functional random forest with applications in dose-response predictions

Rahman, Raziur; Dhruba, Saugato Rahman; Ghosh, Souparno; Pal, Ranadip

doi:10.1038/s41598-018-38231-w

Download PDF

Article
Open access
Published: 07 February 2019

Functional random forest with applications in dose-response predictions

Raziur Rahman¹,
Saugato Rahman Dhruba¹,
Souparno Ghosh² &
…
Ranadip Pal¹

Scientific Reports volume 9, Article number: 1628 (2019) Cite this article

6677 Accesses
38 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Drug sensitivity prediction for individual tumors is a significant challenge in personalized medicine. Current modeling approaches consider prediction of a single metric of the drug response curve such as AUC or IC₅₀. However, the single summary metric of a dose-response curve fails to provide the entire drug sensitivity profile which can be used to design the optimal dose for a patient. In this article, we assess the problem of predicting the complete dose-response curve based on genetic characterizations. We propose an enhancement to the popular ensemble-based Random Forests approach that can directly predict the entire functional profile of a dose-response curve rather than a single summary metric. We design functional regression trees with node costs modified based on dose/response region dependence methodologies and response distribution based approaches. Our results relative to large pharmacological databases such as CCLE and GDSC show a higher accuracy in predicting dose-response curves of the proposed functional framework in contrast to univariate or multivariate Random Forest predicting sensitivities at different dose levels. Furthermore, we also considered the problem of predicting functional responses from functional predictors i.e., estimating the dose-response curves with a model built on dose-dependent expression data. The superior performance of Functional Random Forest using functional data as compared to existing approaches have been shown using the HMS-LINCS dataset. In summary, Functional Random Forest presents an enhanced predictive modeling framework to predict the entire functional response profile considering both static and functional predictors instead of predicting the summary metrics of the response curves.

Side effect prediction based on drug-induced gene expression profiles and random forest with iterative feature selection

Article 21 June 2021

Arzu Cakir, Melisa Tuncer, … Ozlem Ulucan

Network-based Biased Tree Ensembles (NetBiTE) for Drug Sensitivity Prediction and Drug Sensitivity Biomarker Identification in Cancer

Article Open access 04 November 2019

Ali Oskooei, Matteo Manica, … María Rodríguez Martínez

Feature selection strategies for drug sensitivity prediction

Article Open access 10 June 2020

Krzysztof Koras, Dilafruz Juraeva, … Ewa Szczurek

Introduction

Precision medicine plays an important role in the push towards advancing cancer therapy. A significant step in the process involves mapping genetic characterizations to the applied drug sensitivity response. A multitude of approaches have been proposed to address the issue of predictive modeling of drug sensitivity but the results still indicate a significant scope for improvement^1,2,3,4. Crowd-sourced initiatives such as NCI-DREAM conducted Drug Sensitivity Prediction Challenge² enabled the performance evaluation of multiple algorithms on the same dataset while being restricted to smaller number of samples. Recently, a number of pharmacological databases^1,5,6 have been made public to assist researchers in validating their predictive algorithms using larger biological datasets.

Drug sensitivity information in the form of responses for different doses represented as a curve is becoming more prevalent for cancerous cell lines with the advent of advanced data collection techniques. Such datasets are often referred as functional data⁷. Typical approaches for sensitivity prediction predict a summary metric of the entire drug response curve such as Area Under the Curve (AUC) or IC₅₀. The problem of predicting a summary metric of the drug response curve has been tackled using a diverse set of regression approaches such as linear regression with regularization, nonlinear regression, kernel based techniques and ensemble based approaches^2,8,9,10. Additionally, drug sensitivity prediction modeling has also been proposed based on features extracted using Principal Component Analysis (PCA)¹¹.

A primary concern in using a certain drug sensitivity response summary metric is that they fail to describe the entire dose-response effect i.e., they represent just a particular scenario such as the drug concentration to achieve 50% cell viability (IC₅₀) or the inflection point of the dose-response fitted curve (EC₅₀) or the maximal activity reached in the curve (A_max)¹ or the area under the fitted curve (AUC). Meanwhile, various functional regression models have been proposed in other research areas to predict the entire response curve¹². Yu et al.¹³ have presented each response curve as a linear combination of known basis functions and grown regression trees using the coefficients of this expansion, while Nerini et al.¹⁴ have proposed functional PCA in the classification method for easy representation of regression trees. The knowledge of the entire drug response curve can answer clinically relevant questions such as what will be the sensitivity at the highest non-toxic dose concentration (toxicity can be estimated using experimentations on normal cells or computational modeling) or the sensitivity at the drug concentration available at the targeted organ (pharmacokinetics estimated using micro-dosing) for that specific patient? Furthermore, a summary metric such as AUC for two different dose response curves might be same even when they might offer different information such as very high sensitivity for high doses for drug A as compared to relatively moderate sensitivity over all drug doses for drug B. Note that drug A at high doses might be better in killing most cancer cells as compared to drug B which will not be apparent through AUC prediction.

Thus, there is a need for entire dose response curve prediction which is not handled directly by existing regression models. In one of our previous works¹⁵, we have used each dose-response point to build individual regression models for prediction purposes. However, the individual models lack incorporation of the continuous nature of the dose-response curve. In this paper, we are proposing the incorporation of dose-response points or distributions in the generation of regression tree node cost and leaf nodes to improve the accuracy of Random Forest (RF) model for sensitivity prediction. At each regression tree node, region-wise response points or distributions (Gaussian) are considered to calculate the node cost. The leaf nodes store the functional data used to predict the entire dose-response profile for test samples, while the model input consists of genomic characterization in regular form or continuous curve form. We present methodologies that can consider both regular and functional inputs. For analysis purposes, each response curve has been approximated by a linear combination of B-spline functions¹³ and thus, the framework can also be applied in scenarios different from drug sensitivity prediction. We validate our proposed Functional Random Forest (FRF) approach using data from the well-known pharmacological databases of Cancer Cell Line Encyclopedia (CCLE)¹ and Genomics of Drug Sensitivity for Cancer (GDSC)⁵.

The article is organized as follows: The Materials and Methods section compiles the basic steps involved in designing FRF models while discussing the impact of storing functional data in forest leaf nodes and highlighting the region-wise node cost procedures. The Results section provides the performance evaluation of FRF model for both synthetic experiments and actual pharmacological data. Furthermore, it also presents the biological importance of genes selected by FRF. Finally, the Discussion section points out the advantages of using FRF to predict the dose-response curves in the larger context of drug sensitivity prediction and provides possible future research directions.

Materials and Methods

The idea of Functional Random Forest is based on regular regression tree based Random Forest. Thus, we will first describe the design procedure for regular regression trees and subsequently present the construction of functional regression tree based FRF approach. Before delving into the details of tree construction, we describe the datasets used for this study which will help us establish a number of theoretical assumptions in the methodology.

Datasets and Preprocessing

For our experiments, we have considered two most comprehensive publicly available cancer pharmacogenomics databases: Cancer Cell Line Encyclopedia (CCLE)¹ and Genomics of Drug Sensitivity for Cancer (GDSC)⁵. CCLE database was generated by Broad Institute and Novartis Institutes for Biomedical Research. This database includes genetic and pharmacological characterization of 947 human cancer cell lines, together with pharmacological profiling of 24 small molecules (anticancer compounds) across ~500 of these cell lines that encompasses 36 tumor types¹. The response of a cell line to a specific drug is reported for 7 to 8 dose points ranging from 0.0025 μM to 8 μM. Additionally, four different drug sensitivity measures EC₅₀, IC₅₀, A_max and AUC are listed. Note that these measures are features of a dose-response curve fitted from the observed dose-response points. GDSC database was created as part of the Cancer Genome Project⁵ and contains gene expression data for 789 cell lines and drug responses for 714 cell lines. Each cell line has 22,277 probe sets for gene expression yielding a high dimensional feature space. Similar to CCLE, each cell line’s response to the drugs are reported for 7 to 9 dose points where minimum dose ranges from 3 × 10⁻⁵ μM to 15.625 μM and maximum dose ranges from 0.008 μM to 4000 μM. For our experiment, we utilize GDSC v5 that lists two drug sensitivity measures IC₅₀ and AUC along with 105 different IC values for different levels of cell viability from 0.1% to 100% in each cell line for each drug. Note that these IC values are extracted from the complete dose-response curves fitted from the observed dose-response points and extrapolated to 100% cell viability as the curves do not reach 100% at maximum dose for most cell line–drug pairs. Both CCLE and GDSC provide observed dose-response points or fitted curve points which could be utilized as our functional response data. However, the genomic characterization data are available in the stationary format as the expressions are measured before any drug application. Therefore, to demonstrate the functional input and output scenario for our FRF model, we have used data from the Harvard Medical School Library of Integrated Network-Based Cellular Signatures (HMS-LINCS) database, which to our knowledge, is the only publicly available source offering functional responses as well as predictors. HMS-LINCS offers genomic characterization data in the form of Reverse Phase Protein Array (RPPA) expression data for 21 proteins where Phosphorylation state and protein levels were measured in 10 BRAF^V600E/D melanoma cell lines at 7 different doses and 5 different time points¹⁶. The cellular response data consists of viability and apoptosis measured in the same cell lines using Fluorescence imaging apoptosis assay for the same 7 doses but 3 different time points¹⁶. The database contains data for 9 BRAF^V600E and 1 BRAF^V600D melanoma cell lines that were exposed to 4 RAF inhibitors and 1 MEK inhibitor at 7 different doses ranging from 3.2 nM to 3.2 μM. Protein expression data is available for 5 different time points: 1, 5, 10, 24 and 48 hours post drug application and apoptosis data is available for 24, 48 and 72 hours post drug application. For compound sensitivity assessment, two different measures are available: relative viability and mean apoptosis fraction, computed using the number of apoptotic cells and the total number of cells normalized with the DMSO control^16,17.

Figure 1 illustrates the pictorial representations of genomic and functional characterizations data, where the left half shows the static and functional format of genomic characterizations and the right half demonstrates the dose-response curves for various cell line–drug pairs and different summary metrics extracted from such a curve.

Random Forest Regression

Random Forest consists of a set of T un-pruned ensemble of regression trees¹⁸ that are generated based on bootstrap sampling from the original training data. The bootstrap resampling of the data for training each tree increases the diversity between the trees. Each tree is composed of root node, branch nodes and leaf nodes. For each node of a tree, the optimal node splitting feature is selected from a set of m features that are again randomly selected from a feature space of size M. If $m\ll M$, the selection of the node splitting feature from a random set of features decreases the correlation between different trees and thus, the average response of multiple regression trees is expected to have lower variance than the individual regression trees. However, there exists a trade-off as a larger m can improve the predictive capability of individual trees but also can increase the correlation between trees and void any gains from averaging multiple predictions.

Process of splitting a node

Let x_tr(i, j) and y(i) denote the training input feature j and output response, respectively, for sample i where $i=1,2,\ldots ,n,\,j=1,2,\ldots ,M$. At any node η_P, we aim to select a feature j_s from a random set of m (<M) features and a threshold z to partition the node into two child nodes η_L (left node with samples satisfying ${x}_{tr}(i\in {\eta }_{P},{j}_{s})\le z$) and η_R (right node with samples satisfying ${x}_{tr}(i\in {\eta }_{P},{j}_{s}) > z$). We consider the node cost as sum of square deviances (SSD), i.e.

$$D({\eta }_{P})=\sum _{i\in {\eta }_{P}}\,{(y(i)-\mu ({\eta }_{P}))}^{2}$$

(1)

where $\mu ({\eta }_{P})={\mathbb{E}}[y(i\in {\eta }_{P})],\,{\mathbb{E}}[\,\cdot \,]$ denotes the Expected value. Thus, the reduction in cost (i.e., reward function) for partition γ at node η_P is given in Eq. (2), where the goal is to select the partition γ* ∈ η_P that maximizes the reward or, minimizes the cost.

$$\begin{array}{rcl}C(\gamma ,{\eta }_{P}) & = & D({\eta }_{P})-D({\eta }_{L})-D({\eta }_{R})\\ {\gamma }^{\ast } & = & {\rm{\arg }}\mathop{{\rm{\max }}}\limits_{\gamma }\,C(\gamma ,{\eta }_{P})\end{array}$$

(2)

Note that for a continuous feature with n samples, a total of n partitions needs to be checked i.e., the computational complexity of each node split is O(mn). During tree generation, a node with n ≤ n_size samples is not partitioned any further where n_size is a pre-specified sample size threshold.

Several other approaches have been proposed for tree construction such as applying Principal Component Analysis (PCA)¹⁹ in the response matrix¹³. The principal components (PC) not only serve the purpose of dimensionality reduction but is also expected to increase the robustness of the trees. Here, the node cost used to build the trees is given by

$$D({\eta }_{P})=\sum _{i\in {\eta }_{P}}\,{(\zeta (i)-\bar{\zeta }(r))}^{T}\,(\zeta (i)-\bar{\zeta }(r))$$

(3)

where ζ(i) denotes a PC based response vector and $\bar{\zeta }(r)$ is the mean vector of PCs¹⁴. Yu et al.¹³ have also considered the use of basis functions to represent the response variables with the node cost written as

$$D({\eta }_{P})=\sum _{i\in {\eta }_{P}}\,{({\bf{c}}(i)-{\mu }_{c}({\eta }_{P}))}^{T}\,{\rm{\Phi }}({\bf{c}}(i)-{\mu }_{c}({\eta }_{P}))$$

(4)

where c(i) denotes the vector of basis coefficients, ${\mu }_{c}({\eta }_{P})={\mathbb{E}}[{\bf{c}}(i)]$ and Φ denotes the matrix of basis vector inner products¹⁴.

Forest Prediction

Using the randomized feature selection process, we fit the tree based on bootstrap samples $\{({{\bf{X}}}_{1},{Y}_{1}),({{\bf{X}}}_{2},{Y}_{2}),\ldots ,({{\bf{X}}}_{n},{Y}_{n})\}$ from training data. Let us consider the prediction based on a test sample x for the tree Θ. Assume that $\tilde{\gamma }({\bf{x}},{\rm{\Theta }})$ be the partition containing x, the tree response takes the following form^18,20,21 with corresponding weights w_i(x, Θ)

$$y({\bf{x}},{\rm{\Theta }})=\sum _{i=1}^{n}\,{w}_{i}({\bf{x}},{\rm{\Theta }})\,y(i)$$

(5)

$${w}_{i}({\bf{x}},{\rm{\Theta }})=\frac{{{\bf{1}}}_{\{{{\bf{x}}}_{tr}(i)\in \tilde{\gamma }({\bf{x}},{\rm{\Theta }})\}}}{\#\{r:{{\bf{x}}}_{tr}(i)\in \tilde{\gamma }({{\bf{x}}}_{tr}(r),{\rm{\Theta }})\}}$$

(6)

Let the T trees of RF be denoted by ${{\rm{\Theta }}}_{1},{{\rm{\Theta }}}_{2},\ldots ,{{\rm{\Theta }}}_{T}$ and w_i(x) to be the average weights over the forest. Then, the average RF prediction for the test sample x is given by weighted average of predictions of all T trees using the weight vector in (7).

$${w}_{i}({\bf{x}})=\frac{1}{T}\,\sum _{j=1}^{T}\,{w}_{i}({\bf{x}},{{\rm{\Theta }}}_{j})$$

(7)

$$\hat{y}({\bf{x}})=\sum _{i=1}^{n}\,{w}_{i}\,({\bf{x}})\,y(i)$$

(8)

Multivariate Random Forest

Multivariate Random Forest (MRF)¹⁰ is the extension of the regular RF for joint prediction of multivalued output responses that can be useful in different response scenarios. The primary difference between MRF and the regular RF is in the tree generation step where the node cost is different from $D({\eta }_{P})$ in Eq. (1). In a multivariate output scenario, the difference between a sample point response and the multivariate mean distribution is desirable and can be achieved by using the SSD of the Mahalanobis distance measure.

$$\begin{array}{rcl}{D}_{MRF}({\eta }_{P}) & = & \sum _{i\in {\eta }_{P}}\,{({\bf{y}}(i)-\mu ({\eta }_{P}))}^{T}\,{{\rm{\Sigma }}}^{-1}\,({\bf{y}}(i)-\mu ({\eta }_{P}))\\ {\rm{where}}\,{\bf{y}}(i) & = & [y(i,1)\,y(i,2)\,\cdots \,y(i,m)]\end{array}$$

(9)

where Σ is the covariance matrix, m denotes the number of response points, and $\mu ({\eta }_{P})={\mathbb{E}}[{\bf{y}}(i\in {\eta }_{P})]$. The inverse covariance matrix Σ⁻¹ is a precision matrix that provides a measure of conditional dependence between multiple random variables. For our analysis, we consider MRF modeling on 8 dose-response points similar to our earlier published study¹⁵.

Functional Random Forest

Regular classification and regression trees (CART) work on non-functional variables e.g., discrete gene expression values and summary metrics shown in Fig. 1. In this section, we consider incorporating functional responses (e.g., dose-response curves shown in right half of Fig. 1) for building functional random forest (FRF). For this purpose, we have introduced two novel alterations in the regression trees– first, in node cost calculation and second, in regression of the leaf node samples.

Node cost calculation

For the construction of regular regression tree based models, partitioning and accuracy measure for each node η_P is achieved using the deviance criterion in Eq. (1). However, this criterion only considers a single parameter (μ) of the drug sensitivity response while neglecting the shapes of the dose-response curves at each node. To incorporate the shape information of a dose-response curve into the deviance calculation, we propose to discretize the entire curve into multiple regions to calculate the node cost in each region separately and then sum the individual deviances to get the total deviance at each node, i.e.

$${\hat{D}}_{FRF}({\eta }_{P})=\sum _{j=1}^{q}\,{\hat{D}}_{r}({r}_{j})$$

(10)

where ${\hat{D}}_{r}({r}_{j})$ is the deviance calculated from the j^th region r_j, and q is the total number of regions. For the discretization scheme, we choose to discretize the coordinate values as appropriate for the observed data (e.g., we use the 8 given dose points to divide the dose-response curves into 8 regions in CCLE as compared to GDSC where we utilize the ~100 IC response values for discretization). Furthermore, we propose two distinct algorithms for node cost calculation where (i) either the observed dose-response points are used directly or, (ii) the underlying distribution is extracted from these points and various divergence criteria are applied.

Node cost calculation using dose-response points

For this approach, we use the observed dose-response data directly and assume the complete curve to be made up of multiple regions each belonging to an observed dose point or response point. Then, the total deviance at each node η_P is measured by calculating the SSD per region¹⁴ as a measure of ${\hat{D}}_{r}({r}_{j})$ and subsequently using (10).

$${\hat{D}}_{r}({r}_{j})=\sum _{i\in {\eta }_{P}}\,\parallel {y}_{j}(i)-{\bar{y}}_{j}{\parallel }^{2}$$

(11)

where y_j(i) denotes the response in region r_j at dose d_j for sample i, and ${\bar{y}}_{j}={\mathbb{E}}[{y}_{j}(i\in {\eta }_{P})]$. The criterion described in Eq. (11) considers the region-wise differences rather than the difference in an overall feature of the curve.

Node cost calculation using dose-response distributions

In the previous approach, each region consists of ${n}_{P}={\sum }_{i\in {\eta }_{P}}\,i$ response points (i.e., the number of cell lines examined for the applied drug) at a specific dose d_j and these discrete responses are used to compute the node deviance in (10). However, if a study performs multiple experiments at a certain dose for each individual cell line (i.e., technical replicates), we can potentially generate a distribution from all the replicates at that specific dose. Therefore, instead of considering a single response value y_j(i) for cell line i at dose d_j, we can alternatively calculate the node cost by approximating the response by a probability distribution, f_j. The modified splitting criterion for this scenario is given by

$${\hat{D}}_{r}({r}_{j})=\sum _{i\in {\eta }_{P}}\,{C}_{f}({{\rm{\Phi }}}_{i},\hat{{\rm{\Phi }}})$$

(12)

$${\rm{where}}\,{C}_{f}({{\rm{\Phi }}}_{i},\hat{{\rm{\Phi }}})=\sum _{{\rm{\Omega }}}\,\hat{{\rm{\Phi }}}{f}_{j}(\frac{{{\rm{\Phi }}}_{i}}{\hat{{\rm{\Phi }}}})$$

(13)

Here, ${C}_{f}(\,\cdot \,,\cdot \,)$ is called the f-divergence of the probability distribution, Ω is the distribution range, and $\hat{{\rm{\Phi }}}$ is the mean distribution at node η_P derived using mixture distribution⁹. There are various ways to calculate the f-divergence depending on the divergence measure f_j(u) in Eq. (13). For instance, the Kullback-Leibler (KL) divergence²² is obtained with ${f}_{j}(u)=u\,\mathrm{ln}(u)$

$${K}_{f}({{\rm{\Phi }}}_{i},\hat{{\rm{\Phi }}})=\sum _{{\rm{\Omega }}}\,{{\rm{\Phi }}}_{i}\,\mathrm{ln}(\frac{{{\rm{\Phi }}}_{i}}{\hat{{\rm{\Phi }}}})$$

(14)

And, the Hellinger Distance²³ is generated using ${f}_{j}(u)={(\sqrt{u}-1)}^{2}$

$${H}_{f}({{\rm{\Phi }}}_{i},\hat{{\rm{\Phi }}})=\sum _{{\rm{\Omega }}}\,{(\sqrt{{{\rm{\Phi }}}_{i}}-\sqrt{\hat{{\rm{\Phi }}}})}^{2}$$

(15)

Functional regression using dose-response curves

Regular regression tree response for a new sample is based on averaging the responses in the leaf node reached by the new sample. Since the responses considered in a regular regression tree are individual points, a simple averaging of the values suffices. For our FRF scenario, each leaf node consists of a set of functional responses and therefore, we need to modify the final prediction as described below.

Given that we have dose-response points, we can potentially fit a spline curve through these points to represent the dose-response as a continuous curve. In recent pharmacological studies, the curve fitting normally consists of sigmoidal, linear or constant functions¹. In our algorithm, we have considered the generalized B-spline fitting for the dose-response curves. To perform Functional Random Forest (FRF) prediction using the spline-fitted curves, we store the curve points for each sample in the leaf nodes instead of a specific feature (i.e., IC₅₀ or AUC). In the prediction step, for a test sample x, we consider the training response set ${{\bf{y}}}_{j}={y}_{j}(i\in {\eta }_{P})$ at each dose d_j separately from the stored dose-response curves in node η_P and fit a Gaussian distribution N_j. The mode of this distribution (i.e., peak) indicates the highest response probability for x at d_j and we pick the corresponding response value ${\hat{y}}_{j}$ as our final prediction.

$$\begin{array}{rcl}{{\bf{y}}}_{j} & \sim & {N}_{j}({\bf{y}};\,{\mu }_{j},{\sigma }_{j}^{2})\,{\rm{where}}\,{\mu }_{j}\,:\,={\mathbb{E}}[{{\bf{y}}}_{j}],\,{\sigma }_{j}^{2}\,:\,=\mathrm{Var}[{{\bf{y}}}_{j}]\\ {\hat{y}}_{j}({\bf{x}}) & = & {\rm{\arg }}\mathop{{\rm{\max }}}\limits_{y}\,{N}_{j}({\bf{y}};\,{\mu }_{j},{\sigma }_{j}^{2})\end{array}$$

(16)

The process is then repeated for all dose levels to generate the functional prediction, $\hat{{\bf{y}}}({\bf{x}})$. Figure 2 illustrates a representative case where the different response probability distributions are displayed for multiple dose levels. Here, the asterisks (*) on the 3D surface denote the distribution modes at different doses that are used to perform the functional prediction. Subsequently, we can use this predicted curve to estimate the conventional drug sensitivity measures such as AUC, IC₅₀ and EC₅₀.

Function-to-function regression with FRF

Drug sensitivity predictive algorithms normally train regression models on genomic characterizations represented by stationary values such as pre-treatment gene expression (Fig. 1). However, if gene (or protein) expression can be measured post drug application at different doses and/or various time points, the input variables can be modeled as curves representing the dose-expression functions at the corresponding dose points. An example of such functional data is shown in lower left half of Fig. 1 where the functional input-output data is obtained from the HMS-LINCS^16,17 database. In this section, we consider a scenario where the HMS-LINCS protein expressions following drug administration is available along with the resulting dose-responses in terms of cell viability.

Here, we consider a couple of ways to convert the functional data into functional features which are eventually used as model inputs. Similar to the drug sensitivity summary metrics generated from the dose-response curves, we can use the genomic characterization curve to extract features such as AUC and IC₅₀. For calculating AUC, a reference line (similar to the zero viability line for drug sensitivity) is required and we utilize the available DMSO-treated control RPPA data¹⁶ for this purpose. Figure S1 displays a representative dose-expression curve post drug application with the DMSO-treated control line where the shaded area in between is the desired AUC. For this representative protein (p-S6), the expression values are decreasing with increases in dose levels which is the most common scenario. However, for a few cases, the protein expressions either remain almost similar or go up as dose increases. For such proteins, we only consider the expression values below our reference DMSO-treated control line (Fig. 3). Along with AUC, we also calculate different IC values i.e., IC₂₅, IC₅₀ and IC₇₅ to be considered as predictor features. To arrive at the IC values, we perform 3^rd degree polynomial fitting on the observed protein expression data at different doses and record the different IC values using the corresponding percentile points between the lowest and highest expression values (e.g., IC₂₅ is the dose where the 25^th percentile point is located). Figure 3 illustrates three representative protein expression fitted curves with corresponding IC₂₅, IC₅₀ and IC₇₅ points demonstrating the different behaviors described above i.e., expression values are either (a) mostly decreasing, (b) almost unchanged, or (c) mostly increasing with dose.

Another way of extracting the functional curve features is to rank the curves according to their slopes (i.e., rate of change). Furthermore, a curve can be ranked by its position compared to the other curves i.e., if a curve contains >50% dose points with higher protein expression values compared to another curve, the former will get a higher rank than the later and the process will go on until all curves are ranked.

Accession codes

Source code for Functional Random Forest is available at: https://github.com/razrahman/Functional-Random-forest.

Results

In this section, we apply Functional Random Forest modeling on both synthetic and experimental datasets for performance evaluation and comparison analysis with both univariate and Multivariate Random Forest models.

Application of FRF on synthetic data

We first evaluate the performance of FRF using a synthetic experiment. The design matrix has been generated by extracting 10 different features from five different clusters. Each cluster is derived from a Gaussian distribution and the range of the distribution for each cluster has limited overlap with others. Furthermore, we add 10 additional noise features to increase the correlation between samples from different clusters. Subsequently, we have a design matrix of size 75 × 20 (15 samples each from 5 clusters and 20 covariates with 10 relevant & 10 spurious features). For the output, we create a target matrix of size 75 × 101 where 101 is the number of different synthetic dose levels. The response values are sampled from the 4-parameter sigmoidal model¹ in Eq. (17) and shown in Fig. 4 for both noiseless and noisy cases, i.e.

$$y(d)={A}_{0}+\frac{{A}_{{\rm{\max }}}-{A}_{0}}{1+{(\frac{I{C}_{50}}{d})}^{\theta }}$$

(17)

where A₀, A_max & θ are fixed but IC₅₀ differs slightly for each curve in a certain cluster while d is the applied dose level. We also look into the effect of additive noise in targets as shown in Fig. 4 where (a) displays the target curves without noise, and (b) displays the targets with 5% additive noise. Table 1 shows the performance of FRF as compared to regular RF for different numbers of trees, folds and noise levels (%). From Table 1, we observe that FRF displays an overall superior performance to RF in all cases, especially improving the model performance by as much as 25% as the noise level increases. A potential reason for this performance boost is the ability of FRF to incorporate the shape of the response curves, as shown in Fig. 5(a) where FRF is able to follow a noisy synthetic data curve which RF fails to predict, especially for higher doses.

Table 1 Normalized Mean Absolute Errors (NMAE) for prediction of synthetic data dose-responses with varying noise levels using RF and FRF.

Full size table

Application of FRF on biological data

For performance evaluation of Functional Random Forest using actual biological data, we have used three different sources– CCLE, GDSC and HMS-LINCS. The sections below provide the results and corresponding discussion for all three databases.

Application on CCLE dataset

CCLE provides cell line sensitivity data with 7 to 8 dose-response points. For our analysis, we consider the cell lines with 8 points only and thus, we have 8 different regions for node cost calculation in Eq. 10. Tables 2 and 3 display the predictive performance of FRF for both node cost calculation algorithms i.e., using observed dose-response points and underlying distributions. For node cost calculation using distributions, we provide results for both KL divergence and Hellinger distance measures in Eqs (14 and 15). Additionally, we compare the results from the FRF models with standard RF methodology. Tables 2 and 3 provide overall performance comparisons for three different models: (a) regular Random Forest (RF), (b) Functional Random Forest with conventional averaging at the Leaf node (FRFL), and (c) Functional Random Forest with averaging of the dose-response curves at the leaf node (FRF). Note that FRF considers the functional curves for both node cost evaluation and response prediction at the leaf nodes, whereas FRFL considers the functional curves for node cost evaluation only and generates the prediction using the conventional means of averaging of a specific summary metric (e.g., IC₅₀ or AUC) stored at the leaf node. All the results are reported for 5 fold cross-validation with 150 trees in each model along with 10 features for node splitting (m = 10) and minimum leaf size of 10. We note that both functional approaches (i.e., FRFL and FRF) perform better than the regular RF model for all the presented scenarios. We also compare the results with a different set of parameters which also support the previous conclusion that both FRFL and FRF perform better than the RF. Figure 5(b) shows a representative example of both FRF and RF prediction. Note that we are demonstrating a case where the responses are changing gradually for different doses. Although the performances of both FRF and RF were not stellar in general, the FRF prediction still outperforms RF prediction, especially for higher doses.

Table 2 Comparison of predictive performance for AUC from three different approaches: RF, FRFL and FRF with two different model constructions using CCLE data.

Full size table

Table 3 Comparison of predictive performance for AUC from three different approaches: RF, FRFL and FRF using CCLE data.

Full size table

Note that Table 3 considers the dose-responses as probability distributions generated based on the mean and standard deviation (SD) of the responses provided by CCLE. We have fitted a Gaussian distribution using the provided mean and SD of responses for each dose point. The mean distribution at a node is calculated using a mixture of Gaussian distribution assumption. Note that the results in both Tables 2 and 3 provide measures for only 5 representative drugs. Table S1 provides the results for all 24 CCLE drugs.

Both Tables 2 and 3 show the performance measures for 5 fold cross-validation. To demonstrate the robustness of our FRF model compared to RF, we also perform our analysis using bootstrap samples of CCLE data. Considering the total number of samples available for each drug, we extract 50 bootstrap sets of samples to build individual FRF and RF models for each set and then perform sensitivity prediction using the built models. Figure 6 illustrates the distributions of differences between MAE values for FRF and RF model predictions against the number of bootstrap samples for four representative drugs (Fig. S2 provides these distributions for all 24 CCLE drugs). For majority of the sets, MAE of FRF is lower than that of RF yielding negative values in x-coordinate. These distributions clearly demonstrate the superior predictive performance and robustness of FRF as compared to a standard RF. Additionally, Table 4 compares the performance of FRF with that of an MRF model, which also demonstrates the overall superior performance of FRF over MRF for the 8 dose points.

Table 4 Comparison of predictive performances of FRF and MRF for 8 different dose points using CCLE data.

Full size table

Application on GDSC dataset

To demonstrate the versatility of FRF model performance as compared to a traditional RF model, we performed the predictive analysis using another publicly available larger database GDSC. Instead of dose-response points, GDSC v5 provides 105 different IC points for dose-response values, extracted from response curves fitted with sigmoidal functions⁵ and extrapolated to reach 100% cellular viability. This extrapolation causes the dose values for IC₉₀ or IC₁₀₀ to be very high and therefore, we consider only the IC values indicating ≤80% viability in our models. We design a single FRF model to predict the complete dose-response curve from IC₁ to IC₈₀ and thereafter, the AUC. However, RF is unable to replicate this procedure and therefore, we design 8 separate models to predict 8 different IC values in an interval of 10 (i.e., $I{C}_{10},I{C}_{20},\ldots ,I{C}_{80}$) and one additional model to predict the AUC. Table 5 provides the MAE values measured at the 8/IC points and AUC for both FRF and RF for 5 representative drugs (Table S2 provides the performance comparison for all 140 GDSC (v5) drugs). For all 5 drugs, FRF displays a superior performance in predicting different IC and AUC values as compared to RF. These results demonstrate the higher efficacy of FRF in the larger context of drug sensitivity prediction for various dose or response points.

Table 5 Comparison of predictive performance on GDSC dataset for multiple drug sensitivity measures (AUC and 8 IC values) using both RF and FRF.

Full size table

Figure 7 illustrates the difference between MAE values of FRF and RF predictions for Mean IC and AUC values for 70 drugs from GDSC. For mean IC, FRF shows superior performance in 68 out of 70 applied drugs, while FRF outperforms RF in 58 out of 70 applied drugs for AUC prediction. These results support the conclusion achieved from CCLE data analysis that FRF provides higher predictive accuracy than a regular RF. Figure S3 provides the performance comparison of the rest of the 140 GDSC (v5) drugs.

Function-to-function regression using HMS-LINCS

As described earlier, the HMS-LINCS database provides functional data for input proteomic expressions (for 21 proteins) and output cellular viability^16,17 post application of 5 different drugs at 7 different doses in 10 melanoma cell lines at multiple time points. For our analysis, we only use the 48-hour data since it contains complete records for both input and output. Thus, we have 50 samples in total with 143 predictors (i.e., 21 × 7 − 4 = 143, since we exclude 4 proteins due to missing values). The detailed description of the data extraction framework is provided in section Function-to-function regression with FRF with a pictorial representation in Fig. 8. For our function-to-function regression using FRF, we either consider the 143 predictors directly as input features, or extract the 3^rd degree polynomial-fitted dose-expression curve features to use as predictors. As the curve features, we estimate 3 different IC points at IC₂₅, IC₅₀ and IC₇₅ and the overall AUC, as shown in Figs 3 & S1 for all 21 proteins. Table 6 displays the function-to-function regression results for 3 different input scenarios using FRF. We compare these performances with the performances of dose-wise standard RF models using the 143 expression values as input features for the 50 samples. From Table 6, we observe that FRF provides superior performance as compared to RF for all 3 scenarios while the usage of curve IC features provides the highest reduction (~20%) in prediction error. These results clearly demonstrate the potential of FRF in enhancing the predictive modeling performance via utilizing the functional input curve features.

Table 6 Comparison of predictive performance of RF and FRF with functional data input from HMS-LINCS where AUC, IC₂₅, IC₅₀ & IC₇₅ values of proteomic dose-expression curves are used as input features.

Full size table

Biological validation of the models

A potential model validation approach is to consider the variable importance measure (VIM) of the genes. We expect that a better model will have higher feature scores for the significant genes, and thus, in turn will result in a higher biological relevance. Typically in RF based models, VIM (or feature score) is calculated from either the frequency of feature selection, out of bag errors, or permutation measures^24,25. In this section, we use the frequency based approach to calculate the VIM score from the number of times a gene is considered and the number of times it actually gets selected in splitting the nodes.

$${{\rm{VIM}}}_{j}=\frac{\#{\rm{times}}\,{\rm{gene}}\,j\,{\rm{is}}\,{\rm{selected}}}{\#{\rm{times}}\,{\rm{gene}}\,j\,{\rm{is}}\,{\rm{considered}}}=\frac{{m}_{j}^{{\rm{selected}}}}{{m}_{j}^{{\rm{picked}}}}$$

(18)

For our FRF models, we have selected the parameters values as #Trees = 500, m = 50, minimum leaf size = 5 for a 5 fold cross-validation of CCLE data. Based on these values, all 18,405 CCLE genes gets picked around 600 to 900 times, giving each a fair chance to contribute to the model. The top features of the models (i.e., genes with higher VIM scores) are then biologically validated in terms of protein-protein interaction (PPI) network enrichment analysis.

There are a number of Bioinformatics resources (e.g., STRING²⁶, GeneMANIA, DAVID etc.) available for evaluation of the number of observed PPIs in a set of selected genes. These interactions have been determined using prior knowledge and information from various interaction sources such as literature text-mining, experiment results, genomic/proteomic databases, gene co-expressions, gene neighborhood, gene fusion and co-occurrences. For CCLE, we have used Affymetrix HG-U133A mapping to convert the top features into corresponding genes. These genes are then provided as the inputs in the STRING database (http://string-db.org/) to extract the known PPI network. Table 7 shows the PPI analysis results for entire genome with a minimum interaction score of 0.15 for the 5 previously considered drugs for both FRF and equivalent RF models. We observe a higher level of connectivity enrichment for the top 200 FRF features as compared to the top 200 RF features in terms of PPI enrichment p-value and the ratio of observed to expected number of edges²⁷, resulting from possibly the functional collaborations between the products of the FRF genes.

Table 7 Protein-protein interaction enrichment analysis for top 200 genes picked from RF and FRF using the whole genome statistical background with a minimum interaction score of 0.15.

Full size table

Discussion

In this article, we have presented an enhancement to Random Forest modeling that can incorporate both stationary and functional inputs to predict functional output. The ability to predict the complete functional dose-response profile can be instrumental in various scenarios. For instance, there can be multiple dose-response curves with similar values of the extracted features (i.e., AUC or IC₅₀) but they can significantly differ in cytotoxicity or cell viability rate at higher doses. Figure 9 shows an example of this phenomenon where two different dose-response curves for two distinct cell lines in CCLE after AZD-6244 administration have almost the same AUC values (AUC₁ = 0.0945, AUC₂ = 0.095) but different rates of cell viability change at doses ≥0.25 μM. Figure 9 also demonstrates that FRF is capable of capturing the different response curve behaviors for the two cell lines.

Through the application on both synthetic and actual biological data, we have established the superior performance of FRF in predicting dose-response curve summary metrics such as AUC and IC₅₀ as compared to naïve Random Forest model trained on these metrics as output. Furthermore, FRF predicts the entire dose-response profile incorporating the continuous nature of the curve that separate RF models for individual doses fails to capture. We have illustrated this behavior for GDSC dataset by modeling 8 IC points using 8 different RFs to generate the dose-response profile which has an inferior performance compared to the continuous curve prediction from FRF (Table 5). Moreover, a major advantage of predicting a complete curve is the visualization of the changes in response across different doses. Figure 10 shows two representative cases of Curve⁽¹⁾ and Curve⁽²⁾ that has same IC₅₀ values and similar AUC values but their dose-response profiles are significantly different. For instance, a small dose increase above IC₅₀ will produce significantly higher sensitivity for Curve⁽¹⁾ whereas Curve⁽²⁾ will have minimal change for dose increases above the IC₅₀ value. This behavior will not be captured if we only predict the AUC or IC₅₀ summary metric as both the curves have similar IC₅₀ and AUC values. This example illustrates the need for complete dose-response profile prediction in the larger context of drug sensitivity prediction.

There are a number of adjustable parameters available in any regression tree based model (i.e., minimum leaf size, maximum features used for split, and number of trees in the forest) that we can change to get optimal performance, as illustrated in Table 2. Note that increasing the model complexity has similar impact on both RF and FRF models with FRF retaining its superior performance over RF but with a higher computational demand. However, we also observed several drugs in CCLE (e.g., 17-AAG, AZD-6244, Paclitaxel, PD-0325901) for which the prediction errors (MAE) for both FRF and RF are quite high. For these drugs, the dose-response points at different doses for the available cell lines are stretched out and the resulting fitted curves or summary metrics show significant variations which cannot be captured by any Random Forest based model since it employs an smoothing strategy (averaging) in the leaf nodes to provide estimates around the mean prediction. We are currently looking at different types of regression modeling to solve this issue of bias in prediction. We also hope to further extend this work via the incorporation of joint prediction of multiple correlated dose-response profiles while preserving the output dependency structure.

References

Barretina, J. et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).
Article ADS CAS Google Scholar
Costello, J. C. et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nature biotechnology 32, 1202–1212 (2014).
Article CAS Google Scholar
Wan, Q. & Pal, R. An ensemble based top performing approach for nci-dream drug sensitivity prediction challenge. PloS one 9, e101183 (2014).
Article ADS Google Scholar
Pal, R. Predictive Modeling of Drug Sensitivity (Academic Press, 2016).
Yang, W. et al. Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells. Nucleic acids research 41, D955–D961 (2013).
Article CAS Google Scholar
Seashore-Ludlow, B. et al. Harnessing connectivity in a large-scale small-molecule sensitivity dataset. Cancer discovery 5, 1210–1223 (2015).
Article CAS Google Scholar
Sirski, M. On the statistical analysis of functional data arising from designed experiments. Ph.D. thesis, University of Manitoba (Canada) (2012).
Riddick, G. et al. Predicting in vitro drug sensitivity using random forests. Bioinformatics 27, 220–224 (2011).
Article CAS Google Scholar
Rahman, R., Haider, S., Ghosh, S. & Pal, R. Design of probabilistic random forests with applications to anticancer drug sensitivity prediction. Cancer informatics 14, 57 (2015).
PubMed Google Scholar
Rahman, R., Otridge, J. & Pal, R. Integratedmrf: random forest-based framework for integrating prediction from different data types. Bioinformatics (Oxford, England) (2017).
Dhruba, S. R., Rahmanl, R., Matlockl, K., Ghosh, S. & Pal, R. Dimensionality reduction based transfer learning applied to pharmacogenomics databases. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 1246–1249 (IEEE, 2018).
Ramsay, J. O. Functional data analysis (Wiley Online Library, 2006).
Yu, Y. & Lambert, D. Fitting trees to functional data, with an application to time-of-day patterns. Journal of Computational and graphical Statistics 8, 749–762 (1999).
Google Scholar
Nerini, D. & Ghattas, B. Classifying densities using functional regression trees: Applications in oceanology. Computational Statistics & Data Analysis 51, 4984–4993 (2007).
Article MathSciNet Google Scholar
Rahman, R. & Pal, R. Analyzing drug sensitivity prediction based on dose response curve characteristics. In Biomedical and Health Informatics (BHI), 2016 IEEE-EMBS International Conference on, 140–143 (IEEE, 2016).
Fallahi-Sichani, M. et al. Systematic analysis of brafv600e melanomas reveals a role for jnk/c-jun pathway in adaptive resistance to drug-induced apoptosis. Molecular Systems Biology 11, 797 (2015).
Article Google Scholar
Matlock, K., Dhruba, S. R., Nazir, M. & Pal, R. An investigation of proteomic data for application in precision medicine. In Biomedical & Health Informatics (BHI), 2018 IEEE EMBS International Conference on, 377–380 (IEEE, 2018).
Breiman, L. Random forests. Machine learning 45, 5–32 (2001).
Article Google Scholar
Wold, S., Esbensen, K. & Geladi, P. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 37–52 (1987).
Article CAS Google Scholar
Meinshausen, N. Quantile regression forests. Journal of Machine Learning Research 7, 983–999 (2006).
MathSciNet MATH Google Scholar
Biau, G. Analysis of a random forests model. Journal of Machine Learning Research 13, 1063–1095 (2012).
MathSciNet MATH Google Scholar
Kullback, S. & Leibler, R. A. On information and sufficiency. The annals of mathematical statistics 22, 79–86 (1951).
Article MathSciNet Google Scholar
Hellinger, E. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik 136, 210–271 (1909).
MathSciNet MATH Google Scholar
Archer, K. J. & Kimes, R. V. Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis 52, 2249–2260 (2008).
Article MathSciNet Google Scholar
Haider, S., Rahman, R., Ghosh, S. & Pal, R. A copula based approach for design of multivariate random forests for drug sensitivity prediction. PloS one 10, e0144490 (2015).
Article Google Scholar
Szklarczyk, D. et al. String v10: protein–protein interaction networks, integrated over the tree of life. Nucleic acids research 43, D447–D452 (2014).
Article Google Scholar
Taguchi, Y. Principal components analysis based unsupervised feature extraction applied to gene expression analysis of blood from dengue haemorrhagic fever patients. Scientific reports 7, 44016 (2017).
Article ADS Google Scholar

Download references

Acknowledgements

This work has been supported by NIH grant R01GM122084-01.

Author information

Authors and Affiliations

Texas Tech University, Department of Electrical and Computer Engineering, Lubbock, Texas, 79409, USA
Raziur Rahman, Saugato Rahman Dhruba & Ranadip Pal
Texas Tech University, Department of Mathematics and Statistics, Lubbock, Texas, 79409, USA
Souparno Ghosh

Authors

Raziur Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Saugato Rahman Dhruba
View author publications
You can also search for this author in PubMed Google Scholar
Souparno Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Ranadip Pal
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.R., S.G. and R.P. conceived of and designed the experiments. R.R. and S.R.D. performed the experiments. R.R. and R.P. analyzed the data. R.R., S.R.D. and R.P. wrote the paper. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Ranadip Pal.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary: Functional Random Forest with applications in dose response predictions

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rahman, R., Dhruba, S.R., Ghosh, S. et al. Functional random forest with applications in dose-response predictions. Sci Rep 9, 1628 (2019). https://doi.org/10.1038/s41598-018-38231-w

Download citation

Received: 29 May 2018
Accepted: 20 December 2018
Published: 07 February 2019
DOI: https://doi.org/10.1038/s41598-018-38231-w

This article is cited by

Challenges and Possible Solutions to Direct-Acting Oral Anticoagulants (DOACs) Dosing in Patients with Extreme Bodyweight and Renal Impairment
- Ezekwesiri Michael Nwanosike
- Wendy Sunter
- Syed Shahzad Hasan
American Journal of Cardiovascular Drugs (2023)
An Effective Approach to Improve the Automatic Segmentation and Classification Accuracy of Brain Metastasis by Combining Multi-phase Delay Enhanced MR Images
- Mingming Chen
- Yujie Guo
- Guanzhong Gong
Journal of Digital Imaging (2023)
Optimal flood susceptibility model based on performance comparisons of LR, EGB, and RF algorithms
- Ahmed M. Youssef
- Ali M. Mahdi
- Hamid Reza Pourghasemi
Natural Hazards (2023)
Supervised classification of curves via a combined use of functional data analysis and tree-based methods
- Fabrizio Maturo
- Rosanna Verde
Computational Statistics (2023)
Source discrimination of mine water based on the random forest method
- Zhenwei Yang
- Hang Lv
- Xinyi Wang
Scientific Reports (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Materials and Methods

Datasets and Preprocessing

Random Forest Regression

Process of splitting a node

Forest Prediction

Multivariate Random Forest

Functional Random Forest

Node cost calculation

Node cost calculation using dose-response points

Node cost calculation using dose-response distributions

Functional regression using dose-response curves

Function-to-function regression with FRF

Accession codes

Results

Application of FRF on synthetic data

Application of FRF on biological data

Application on CCLE dataset

Application on GDSC dataset

Function-to-function regression using HMS-LINCS

Biological validation of the models

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links