Logic models to predict continuous outputs based on binary inputs with an application to personalized cancer therapy

Mining large datasets using machine learning approaches often leads to models that are hard to interpret and not amenable to the generation of hypotheses that can be experimentally tested. We present ‘Logic Optimization for Binary Input to Continuous Output’ (LOBICO), a computational approach that infers small and easily interpretable logic models of binary input features that explain a continuous output variable. Applying LOBICO to a large cancer cell line panel, we find that logic combinations of multiple mutations are more predictive of drug response than single gene predictors. Importantly, we show that the use of the continuous information leads to robust and more accurate logic models. LOBICO implements the ability to uncover logic models around predefined operating points in terms of sensitivity and specificity. As such, it represents an important step towards practical application of interpretable logic models.


Supplementary Note 1 -Validation on the CTRP dataset
We aimed to validate the logic models inferred by LOBICO on our cell line panel by 28 applying these logic models to the drug response data of another cell line panel: the Cancer 29 Therapeutic Response Portal version 2 (CTRP) 1 . 30 CTRP data was obtained from the supplementary information files of the 31 corresponding main publication, available online on the Cancer Discovery journal web-site 32 at: http://cancerdiscovery.aacrjournals.org/content/early/2015/10/14/2159-8290.CD-15-33 0235/suppl/DC1 (Supplemental Tables S1 -S7, file: 145780_2_supp_3058746_nrhtdz.xlsx). 34 From these files, identifiers of screened cell lines and compounds were extracted and 35 mapped to the cell lines and compound identifiers of our study (from now called GDSC for 36 Genomics of Drug Sensitivity in Cancer). In total, there were 47 overlapping drugs between 37 GDSC and CTRP. For CTRP, the drug response indicator is the AUC, i.e. the area under the 38 drug/cell-line dose response curve. IC50 values were not available for the CTRP study. For each of the two scenarios, we ran LOBICO across the 47 compounds on GDSC 48 using the same settings as for the original analysis. For each drug we selected the best model 49 according to cross-validation (CV) and applied that model to the cell lines dividing them into 50 a group that is predicted to be sensitive and a group that is predicted to be resistant. Then, 51 we performed a t-test comparing these two groups both for the GDSC IC50s as well as for 52 the CTRP AUCs. We also performed a t-test for the best single predictor model. The t-tests 53 were only performed when the groups consisted of at least 5 cell lines. This led to 37 and 39 54 drugs for scenarios 1 (Train:GDSC344-Validate:CTRP344) and scenario 2 (Train:  Validate:CTRP344), respectively, that we could test within this framework. 56 This analysis confirmed our hypothesis that the larger variation in FI scores observed 147 across CV folds is due to the inclusion or exclusion of samples with large weights, i.e. those 148 far away from the binarization threshold. 149 150 Specifically, we randomly sampled from all 714 cell lines 90% to 1% of the cell lines 151 in 13 steps, i.e. 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 15%, 10%, 5%, 2%, 1%, and 152 repeated this 10 times. Then, for each case we ran LOBICO across all 142 drugs using the 153 original settings. We analyzed the CV errors and the FI scores across the repeats and 154 compared the results to the setting when we use all (100%) of the cell lines. Based on the CV 155 errors, FI scores and the permutation test (Methods section), we employed the following 5 156 criteria to identify 'robust' and 'predictive' models. All criteria must be met in order to call a 157 model robust and predictive. 158

Supplementary Note 3 -Subsampling analysis
1. The number of sensitive cell lines must be larger than 10 -This is prerequisite for 159 running LOBICO using 10-fold CV and is also the minimum number of cell lines in 160 the positive (sensitive class). In our cell line panel, for most drugs, the bulk of cell 161 lines are not affected, and only a small percentage (5-15% typically) end up in the 162 class of sensitive cell lines. When subsampling to 50% (~300 cell lines) still 135 163 (95%) of the models can be run. When sampling 20% and 10% of the cell lines 164 (~125 and 65 cell lines resp.) LOBICO models can only be run for 77 (54%) and 23 165 (16%) of the models. See Supplementary Figure 9a. 166 2. The CV error must be smaller than 0.4 -The statistical cutoff of FDR<1% and 167 p<0.01 that we used to identify statistically significant logic models using the 168 permutation test coincided with a CV error of approximately 0.4 (see Figure 2). 169 Therefore, we used this cutoff to identify predictive models. Without 170 subsampling, i.e. using all cell lines, 72 (51%) of the models are predictive. This 171 percentage remains more or less constant when subsampling. See 172 Supplementary Figure 9b. 173 3. The error across the complete dataset must be smaller than 0.4 -The optimal 174 logic model for each of the subsampling rates and repeats was applied to all cell 175 lines after which the error across the complete cohort was computed. We 176 required this error to be smaller than 0.4 in which case the logic model inferred 177 using a subset of the samples generalizes over the complete dataset. We 178 observed that the although CV-error remains constant, the error on the complete 179 dataset increases with a smaller subsampling frequency. This is an indication that 180 when using a smaller set of cell lines the inferred logic models do a good job of 181 explaining the drug response for those cell lines, but these models do not 182 generalize across a larger panel. For example, when sampling 10% of the cell lines 183 (~65 cell lines) the percentage of predictive models drops to ~30%. See 184 Supplementary Figure 9c. 185 4. The Pearson correlation of FI scores amongst the CV folds must be higher than 0.7 186 -We observed that the Pearson correlation coefficients of the similarity of FI 187 scores across the 10 CV folds on the complete dataset were higher than 0.7 for 188 most drugs (Supplementary Figure 8). Therefore, we used this cutoff to identify 189 robust models in the subsampling analysis. Without subsampling, i.e. using all cell dataset must be higher than 0.7 -The FI scores obtained for the logic models for 195 each of the subsampling rates and repeats were correlated with the FI scores 196 from the logic models inferred across the complete cohort. We required this 197 correlation to be larger than 0.7 in which case the logic models inferred using a 198 subset of the samples are highly similar to the original logic model based on all 199 samples. We observed that correlation of these FI scores decreased quickly with a 200 smaller subsampling frequency. This is an indication that when using a smaller set 201 of cell lines the inferred logic models are different from the original model. For 202 example, when sampling 50% and 20% of the cell lines (~300 and 125 cell lines 203 resp.) 55 (41%) and 32 (28%) logic models met this criterion. See Supplementary 204 Overall, we observed that subsampling had a substantial influence on performance and 206 robustness as defined by our criteria. Specifically, whereas for 51% of the drugs LOBICO 207 inferred logic models that are robust and predictive when using all cell lines, this number 208 decreased to 25% of the drugs when using 50% of the cell lines (around 300 cell lines). With 209 a subsampling frequency of 10% (around 60 cell lines) only 4 (18%) of the drugs had a robust 210 and predictive model. Perhaps not surprising, these 4 drugs were amongst the ones with the 211 lowest CV-error on the complete dataset. 212 In conclusion, LOBICO can also effectively be run on smaller sets of cell lines, but in our panel 213 there was only a relatively small number of drugs for which robust and predictive models 214 were found with much smaller sets of cell lines. The subsampling analysis led to two 215 important insights: First, it is important that the classes in the dataset are not strongly 216 unbalanced; if one of the two classes is too small this leads to non-robust models or even 217 the inability to run LOBICO using CV. Second, only for the highly predictive models, i.e. those 218 with a small CV error, did we find robust and predictive models when using a small set of cell 219 lines. Thus, for smaller sets of cell lines strong effect sizes are necessary to reach 220 significance. This is something that can potentially be tested with univariate tests before 221 running LOBICO. 222

Supplementary Note 4 -LOBICO on a yeast cross phenotyped for
223 sporulation efficiency 224 We re-analyzed the genetic linking map of a cross of two natural yeast strains, a 225 strain isolated from the bark of an oak tree that sporulates at 99% efficiency, and a strain 226 originating from a wine barrel that sporulates at only 3.5% 2 . The genetic linkage map 227 consists of 225 loci genotyped in 374 segregants. For each of the 374 recombinant offspring, 228 the sporulation efficiency was measured as a percentage between 0 and 100. Gerke et al. 2 229 used composite interval mapping based on a stepwise regression model to find loci that 230 significantly cosegregated with variation in sporulation efficiency, leading to 5 significant 231 loci, L7-9, L10-14, L13-6, L7-17 and L11-2 ( Table 1 in 2 ). Next, a second stepwise regression 232 was used to select significant predictors from the five loci and all 2 and 3-way interaction 233 terms involving these five loci. The final model included three significant 2-way interaction 234 effects and one 3-way interaction effect. All these interactions were comprised of 235 combinations of the three most significant individual loci, i.e. L7-9, L10-14 and L13-6 (Table  236 S2 in 2 ). 237 We applied LOBICO to this dataset to evaluate which (logical) interaction effects 238 would be uncovered. The genotype information was straightforwardly transformed into 239 binary predictor variables: Alleles from the oak strain (wine strain) were set to 1 (0), 240 resulting in a truth table with 225 = n loci and 374 = p segregants. The sporulation 241 phenotype data was binarized by applying a threshold of 50%. Samples were weighted using 242 the distance to this threshold. No specificity and sensitivity constraints were applied. We 243 employed the eight model complexities also used for the cell line panel analysis, i.e. all 244 The largest single effect found in Gerke et al., loci L7-9, is also the best single 246 predictor uncovered by LOBICO (Supplementary Figure 11). The two-input AND model found 247 by LOBICO consisted of loci L7-9 and L10-14. This interaction, which is also one of the 2-way 248 interaction effects found in Gerke et al. has a much higher specificity and precision than the 249 single locus model, although a smaller recall. Many of the offspring with the highest 250 sporulation efficiency have both the L7-9 and the L10-14 locus from the oak strain. The best 251 model according to CV is a 2-by-2 model, which contains the same three loci as in the 252 interaction effects found in Gerke et al., i.e. L7-9, L10-14 and L13-6. (Actually, the LOBICO 2-253 by-2 model contained L13-7 instead of L13-6; they are highly correlated. The fourth feature 254 in the 2-by-2 is L7-11, which is highly correlated to L7-9.) Thus, LOBICO finds interactions 255 between the same three loci as the regression model employed by Gerke et al.. 256 It is important to point out that LOBICO uncovered these interactions using the 257 complete dataset of 225 loci, and not by first filtering on individual features as was done in 258 Gerke et al.. Surely, the (biological) interpretation of the logic model and the additive linear 259 model is quite different. We would argue that the logic model is more intuitive and sensible 260 than the linear model. 261  The decision version of LOBICO is as follows: Given inputs X , y , w , K , M and a 281 parameter , does there exist a logic function Θ expressed in DNF( K , M ), i.e. a DNF having 282 at most K disjunctive terms and M literals, such that the weighted sum of incorrectly 283 inferred samples as described in Equation 1 is less than or equal to ε ? Cleary, the problem is 284 in NP. It is easily shown that this problem is also NP-complete by the following polynomial-285 time reduction from BFSP: Given an instance of BFSP we construct an instance for LOBICO by 286 deriving X from the minterms, y from the . Since the BFSP decision problem is NP-complete, the LOBICO decision problem is also 289 NP-complete. 290 291 Logic regression (LR) 4,5 is a generalized regression methodology that can be applied 292 to data with binary predictors, although continuous predictors are also allowed. The goal of 293 LR is to find linearly weighted logic combinations of the original predictors that explain a 294 continuous response variable or class label. We configured the implementation of LR, i.e. the 295 R-package 'logreg', such that it infers logic models with a predefined model complexity. 296

Supplementary Note 6 -Comparison with logic regression
Specifically, logreg has a scoring function for classification using sample-specific weights, 297 which we used to give it the same objective function as LOBICO (Equation 1). Also, logreg 298 can be configured to output a single logic model (a tree) with predefined logical operators 299 and size. (See below for experimental details.) 300 We ran LR for each of the 142 drugs in the cancer cell line panel using the model 301 complexity (defined by K and M) selected by CV for the associated drug when using LOBICO. 302 LR was run on the same computers (Intel(R) Xeon(R) CPU, E5645, 2.40GHz, 6 cores) as 303 LOBICO and was given the same amount of CPU time (Supplementary Figure 18a). Then, we 304 evaluated the logic formulas inferred by LR. Specifically, we looked at the Jaccard similarity 305 of the selected predictors in the inferred LOBICO and LR models. (In computing the Jaccard 306 similarity negated terms, e.g. ¬TP53, are treated as separate predictors from their positive 307 equivalents.) For models with K=1 and/or M=1, a Jaccard similarity of 1 indicates that the 308 exact same logic formula was found. For K=2 and M=2 (the 2x2 models) this is not 309 necessarily the case, but we were not able to restrict logreg to output a DNF with K=2 and 310 where the logic formulas differed between LOBICO and LR. Upon further inspection, we 317 found that in these cases the formula inferred by LR had the same (optimal) error as the 318 LOBICO solution. These LR solutions were present in LOBICO's solution pool, i.e. they were 319 part of the set of (sub-)optimal solutions output by LOBICO. (See experimental details 320 below.) 321 In conclusion, when logreg parameters are properly set, LR can find the optimal 322 solution when given the same amount of time that was necessary for LOBICO to find the 323 optimal solution on the cancer cell line dataset. Potentially, LR finds this solution faster than 324 LOBICO on this dataset. It is however important to point out that LR cannot guarantee that 325 the obtained solution is optimal, and it is known that ILP solvers spend a long time proving 326 that the found solution is indeed optimal. In future work, we will investigate the 327 performance of LR and LOBICO on other (larger) datasets, and assess how the two methods 328 can be used in parallel to find optimal solutions faster. For example, we will investigate 329 whether LR can be used to identify initial starting models for LOBICO. 330 Importantly, LR cannot incorporate statistical performance constraints, such as 331 sensitivity and specificity (Equations 10 -13), which we assert is the preferred and, in 332 practice, most relevant scenario for LOBICO inferences. Additionally, in contrast to LR, 333 LOBICO can output the pool of (sub-)optimal solutions, which we used to measure feature 334 importance. 335 Experimental details: LR was run using the R-packing logreg (version 1.5.8). We

344
To infer a LR model with the same model complexity as LOBICO, we made sure that 345 for 2-, 3-and 4-input AND models only AND operators were allowed. Similarly, for 2-, 3-and 346 4-input OR models only OR operators were allowed. For 2x2 models we allowed both AND 347 and OR operators. The parameters of the logic 'tree' shape were set as follows (R-code): 348

351
LR was run to output one logic tree (ntrees=1), where the maximum number of 352 leaves was set to K x M (nleaves=K*M). In the R-code below Y is y , X is X and W is w as 353 used in the Methods Section and Equation We observed a large concordance between LOBICO's FI scores and the EN regression 365 weights (Supplementary Figure 12a). In the EN models, the large majority (63%) of all 366 regression weights across the 60 features and 25 drugs were 0. Importantly, all of the 367 important features according to LOBICO (FI>0.05) had a non-zero regression weight in EN. 368 Moreover, the smallest EN regression weight for which the corresponding LOBICO FI was 369 larger than 0.05, was 0.2752 (blue line in Supplementary Figure 12a), which was in the tail of 370 the EN weights. 371 Similarly for RF, we observed a high degree of correlation between LOBICO's FI scores 372 and the RF importance scores (Supplementary Figure 12b)  log10 p-values for t-tests that quantify the difference between cell lines predicted to be 475 sensitive and resistant according to LOBICO. In case the best model is a multi-predictor 476 model, the -log10 p-value for the best single predictor model is also depicted, and the single-477 predictor formula is stated in parentheses behind the formula of the multi-predictor model. 478 In case the single-predictor model led to a lower p-value, yet according to CV the multi-479 predictor was better, the formula of the multi-predictor model is stated between 480 parantheses. The 39 drugs are sorted based on the t-test p-value derived from the GDSC370 481 IC50s Of the 72 statistically significant logic models at FDR<1% and p<0.01, we selected those for 496 which at least one gene mutation feature had a FI score of 0.1 or higher. The statistical 497 cutoff of FDR<1% and p<0.01 that we used to identify statistically significant logic models 498 coincided with a CV error of approximately 0.4 (see Figure 2). Since the FI score for a feature 499 is the increase in error when the feature is left out of the inferred logic model, and the 500 randomly expected CV error is 0.5, we classified gene mutation features with a FI score 501 larger than 0.1 as 'essential' features for explaining the drug response. This heatmap 502 visualizes the FI score of those features with the drugs on the rows and the features on the 503 columns. The number in parentheses behind the gene labels indicates the number of drugs 504 for which these genes had an FI score larger than 0.1. These results show that most gene 505 mutations are essential for not more than one drug, but that BRAF (6)