(a) Graph of the CNN model architecture. (b) Example of 5-fold cross-validation using only the training dataset, further analyzed in the subsequent two panels of this figure. A similar scheme was used to optimize hyperparameters for the CNN model, albeit with 3-fold cross-validation to allow for larger training sets in each split. (c) Model loss, measured as root mean squared error, for training and test data over 30 training epochs. Each line represents one of 5 splits diagrammed in panel b. The final models used for our predictions were trained for 8 epochs, as additional cycles only reduced training loss without significant improvement in validation loss (i.e., the model becomes overfit). (d) Stability of the model with different input data. For each split in panel b, 20 independent CNN models were trained for 8 epochs on the same data. The root mean squared error on the test set for each model is plotted as a blue dot. Box plots indicate the interquartile range of each distribution. (e) Model loss for the final CNN ensemble. Each line represents one of 20 models trained for 8 epochs on the entire training set. (f) Explained variance of validation sgRNA relative activities for each individual model (black), and for the mean prediction of all 20 models (red). n = 5,241 sgRNAs evaluated for each model; r2 = squared Pearson correlation coefficient. (g) Validation error stratified by mismatch position. (h) Validation error stratified by mismatch type. (i) Comparison of CNN prediction error (difference between measured and predicted activity) and off-target specificity score for all sgRNAs in the validation set. Off-target specificity scores were calculated using CRISPRi relative activities as described in the Methods. n = 5,241 sgRNAs; r = Pearson correlation coefficient. (j) Partitioning of sgRNAs into bins based on relative activity in the large-scale K562 screen. (k) Confusion matrix showing the fraction of sgRNAs in each actual (measured) activity bin that were assigned to each predicted bin by the CNN model. Each row sums to 1. (l) Statistics indicating the requisite number of randomly sampled sgRNAs from each activity bin to have a given probability of selecting at least one sgRNA with true activity in that bin. Simulations are based on the probabilities outlined in the confusion matrix (panel e). (m) Similar to panel l, with random sampling from bin 2 (relative activity 0.37-0.63) to yield at least one sgRNA with intermediate activity (0.1-0.9). We tested several sampling schemes (e.g. drawing from bin 1, 2, 3, or combinations of these), and found this method to empirically give the highest success rate for selecting sgRNAs with intermediate activities.