Extended Data Fig. 2: The complexity of trained models. | Nature

Extended Data Fig. 2: The complexity of trained models.

From: In silico saturation mutagenesis of cancer genes

Extended Data Fig. 2

a, Schematics depicting the calculation of the complexity index of models from the assessed compressibility of SHAP values for the 18 mutational features of a given model. Specifically, the smaller the number of principal components explaining a larger fraction of variance, the smaller the complexity index. b, Effect of reducing the size of observed mutations available for training on the performance and complexity index of three models. The circle and bars represent the median and IQR of F50 values computed from the cross-validation. As the size of the set of mutations available to train the models is reduced, the performance and complexity of built models decrease, demonstrating that as more cancer genomes are sequenced, more good-quality specific boostDM models will be within reach. c, Comparison of the performance of the cross-validation of TP53 high-confidence models (median F50 and IQR of values obtained from the base classifiers) with models trained and tested on randomly drawn 90% of the samples of each tumour type, leaving the remaining 10% held out as external datasets for validation. d, Distribution of median F50 of cross-validation and re-trained models (as in c) for 155 cancer gene–tumour type combinations for which re-training with 90% of the original samples is possible, respecting the conditions set out in Extended Data Fig. 1a. Boxplots: centre line, median; box limits, first and third quartiles; whiskers, lowest/highest data points at first quartile minus/plus 1.5× IQR.

Back to article page