Label noise assessment.
(a,b) Label noise estimates for each of the metrics using in CCLE (a) and for each data set (b) were computed by training elastic net regression models on the true labels as well as on permuted labels using a bootstrapping procedure. We tested whether the mean squared error (MSE) of the models trained on the true labels significantly outperformed the MSE of the models trained on the permuted labels. For each metric in the CCLE data set (a), the cumulative distribution shows the number of drugs passing different noise thresholds. The activity area has the lowest level of label noise compared to Amax, IC50 and EC50. The number of drugs passing a threshold of P < 0.01 for activity area, Amax, IC50 and EC50 was 22, 22, 19 and 14, respectively. For each data set (b), the cumulative distribution plot shows the proportion of drugs passing different noise thresholds. This analysis found that 91.7%, 59.9% and 74.5% of drugs in the CCLE (activity area), CTD2 (area under the dose response curve) and NCI60 (−log10(GI50)) data sets had true labels that significantly outperformed the permuted labels in this comparison (P < 0.01, one-sided Wilcoxon rank sum test). (c) Scatter plot of CTD2 drug response dynamic range (interquartile range) versus label noise (−log10(P)), Pearson correlation coefficient = 0.63.