Agreement between two large pan-cancer CRISPR-Cas9 gene dependency data sets

Genome-scale CRISPR-Cas9 viability screens performed in cancer cell lines provide a systematic approach to identify cancer dependencies and new therapeutic targets. As multiple large-scale screens become available, a formal assessment of the reproducibility of these experiments becomes necessary. We analyze data from recently published pan-cancer CRISPR-Cas9 screens performed at the Broad and Sanger Institutes. Despite significant differences in experimental protocols and reagents, we find that the screen results are highly concordant across multiple metrics with both common and specific dependencies jointly identified across the two studies. Furthermore, robust biomarkers of gene dependency found in one data set are recovered in the other. Through further analysis and replication experiments at each institute, we show that batch effects are driven principally by two key experimental parameters: the reagent library and the assay length. These results indicate that the Broad and Sanger CRISPR-Cas9 viability screens yield robust and reproducible findings.


Introduction
Despite recent advances in cancer science, most cancer patients still have no clinical indications for approved targeted therapies 1 . Expanding precision oncology to the general patient population will require identifying and exploiting many new genomic targets. To tackle this problem, large-scale pharmacogenomic screenings have been performed across panels of human cancer cell lines 2,3 . The advent of genome editing by CRISPR-Cas9 technology has allowed extending these studies beyond currently druggable targets with precision and scale 4,5 . Pooled CRISPR-Cas9 screens employing genome-scale libraries of single guide RNAs (sgRNAs) are being performed on growing numbers of cancer in vitro models [6][7][8][9][10][11][12] . The output of these screens can be used to identify and prioritize new cancer therapeutic targets 13 . However, fully characterizing genetic vulnerabilities in cancers is estimated to require thousands of genome-scale screens 14 .
We present a comparative analysis of datasets derived from the two largest independent CRISPR-Cas9 based gene-dependency screening studies in cancer cell lines published to date 13,15,16 , part of the Cancer Dependency Map effort 17,18 . The aim of this analysis was to assess the concordance of these datasets and that of the analytical outcomes they yield when investigated individually. To this aim, we designed a computational strategy including comparisons at different levels of data-processing and abstraction: from gene-level dependencies to molecular markers of dependencies, and genome-scale cell line profiles of 2 dependencies. Lastly, we shed light on the differences in the experimental settings that give rise to batch effects across independent studies of this kind, discerning between biological and technical confounding factors.

Overview of datasets and comparison strategy
We compared two sets of pooled genome-scale CRISPR-Cas9 drop out screens in cancer cell lines, generated at the Broad Institute and the Sanger Institute through experimental pipelines (detailed in We calculated gene dependency scores using three different strategies. First, we considered fully processed gene scores, available for download from the Broad 17 and Sanger 13,18 Cancer Dependency Map web-portals 17 13,18 ( processed data). Because data processing pipelines vary significantly between the two datasets, we also examined minimally processed gene scores, generated by computing median sgRNA abundance fold changes for each gene ( unprocessed data). Lastly, we applied an established empirical Bayesian batch correction method (ComBat) 19 to the unprocessed gene scores to remove experimental batch effects between the datasets. ComBat aligns gene means and variances between the datasets, thereby eliminating simple batch effects. We refer to this form of the data as the batch-corrected gene scores. We further tested whether it was possible to recover consistent sets of common dependencies. To this end, we defined as "common dependencies" those genes that rank among the top dependencies when considering only their 90th percentile of least dependent cell lines, with the score threshold for "top" dependencies determined by the local minimum in the data ( Fig. 1c ). For the unprocessed data, the Broad and Sanger jointly identify 1,031 common dependency genes ( Supplementary Table 3  in unprocessed data) were due to the single gene HSPA5, which is an SSD in Sanger data but a common dependency in Broad data. Examining SSDs individually, we found median Cohen's kappa for sensitivity to individual SSDs of 0.437 in processed, 0.661 in unprocessed, and 0.735 in batch-corrected data. In unprocessed data, 65% of SSDs had Cohen's kappa greater than 0.4, as opposed to 0.17% seen by chance ( Supplementary Fig. 1c ).

Agreement of cell line dependency profiles
Previous literature on reproducibility highlighted the importance of considering agreement along both the perturbation and cell line axes of the data [23][24][25] . We assembled a combined dataset of cell line dependency profiles from both studies and computed all possible pairwise correlation distances between them, using genes that were dependencies in at least one cell line ( variable genes). A t-distributed stochastic neighbor embedding (tSNE) 26 visualization derived from these distance scores is shown in Fig. 2d . For the uncorrected data, we observed a perfect clustering of the dependency profiles by their study of origin, confirming a major batch effect. However, following batch correction, we observed integration between studies and increased proximity of cell lines from one study to their counterparts in the other study ( The batch correction also aligned numbers of significant (at 5% FDR) dependencies across cell lines between the two datasets (median number of dependencies 2,109 and 1,717 before, and 2,053 and 1,950 after correction, for Broad and Sanger respectively, Supplementary Fig. 3a ). The average proportion of dependencies detected in both studies over those detected in at least one study also 7 increased across cell lines from 47.75% to 59.14%. Furthermore, the correlation between cell lines after correction rose above the correlation within each individual screen for each gene set considered ( Supplementary Fig 3b ). We finally examined whether the residual disagreement in corrected data might be related to screen quality. We assessed screen quality by computing true positive rates (TPRs) for recovering common essential genes in each cell line with a fixed 5% false discovery rate (FDR), determined from the distribution of nonessential genes in the cell line. We found that mean screen quality is a strong predictor of screen agreement for both the uncorrected and batch-corrected data sets

Agreement of gene dependency biomarkers
A selective dependency is of limited therapeutic value unless it can be reliably associated with an informative molecular feature of cancer ( biomarker ). Following a similar approach to that presented in 21 , we performed a systematic test for molecular-feature/dependency associations on the two datasets. Cell lines were split into two groups based on the status of 587 molecular features derived from Iorio et al. 27 , encompassing somatic mutations in high-confidence cancer driver genes, amplifications/deletions of chromosomal segments recurrently altered in cancer, hypermethylated gene promoters, microsatellite instability status and the tissue of origin of the cell lines ( Supplementary Table 5 ). For each feature in turn, all SSD genes were sequentially t-tested for significant differences in dependency scores between the obtained two groups of cell lines.
These tests yielded 71 out of 29,350possible significant associations (FDR < 5%, ΔFC < -1) between molecular features and gene dependency when using the Broad unprocessed data, and 90 when using the Sanger unprocessed data ( Supplementary Table 6 ). Of these, 55 (77% of the Broad associations and 61% of the Sanger ones) were found in both datasets (FET p-value = 9.08 x10 -133 , Fig. 3a and Supplementary Tables 6-7 ). The concordance between the associations identified by each study was proportional to the threshold used to define significance. This was assessed by considering for each study, in turn, the associations in a fixed quantile of significance and measuring the tendency of these associations to be among the most significant in the other study ( Fig. 3b ). Further, the overall correlation between differences in gene depletion FCs was equal to 0.763, and 99.2% of associations had the same sign of differential dependency across the two studies.
Gene dependency associations identified with both datasets included expected as well as potentially novel hits. Examples of expected associations included increased dependency on ERBB2 in ERBB2-amplified cell lines, and increased dependency on beta-catenin in APC mutant cell lines. A potentially novel association between FAM72B promoter hypermethylation and beta-catenin was also identified ( Fig. 3c ).

10
We also considered gene expression to mine for biomarkers of gene dependency using RNA-seq datasets maintained at Broad and Sanger institutes. To this aim, we considered as potential biomarkers 1,987 genes from intersecting the top 2,000 most variable gene expression levels measured by either institute. Clustering the RNA-seq profiles revealed that each cell line' transcriptome matched closest to its counterpart from the other institute ( Supplementary Fig. 4a ).
We correlated the gene expression level for the most variably expressed genes to the gene dependency profiles of the SSD genes. Systematic tests of each correlation showed significant associations between gene expression and dependency ( Fig. 3d ). As with the genomic biomarkers, we found a strong overall correlation between gene expression markers and SSD genes dependency across datasets, Pearson's correlation 0.804 and significantly high overlap between gene expression biomarkers identified in each dataset (Fisher's exact test p-value below machine precision). We observed both positive and negative correlations; for example, ERBB2 dependency score was positively correlated with its expression, while ATP6V0E1 showed significant dependency when its paralog ATP6V0E2 had a low expression ( Fig. 3e ). Additionally, previous studies have shown that some gene inactivations results in cellular fitness reduction only in lengthy experiments 11 . Accordingly, we selected the sgRNA library and the timepoint of viability readout for primary investigation as causes of major batch effects across the two compared studies.
To elucidate the role of the sgRNA library, we examined the data at the level of individual sgRNA scores.
The correlation between log fold change patterns of reagents targeting the same gene ("co-targeting") across studies was related to the selectivity of the average dependency of that gene (as quantified by a "Likelihood Ratio Test" -normLRT -score 22 , Fig. 4a ): a reminder that most co-targeting reagents show low correlation because they target genes exerting little phenotypic variation. However, even among SSDs there was a clear relationship between sgRNA correlations within and between datasets ( p = 4.9 x 10 -10 , N = 49; Fig. 4b ). In such cases, poor reagent efficacy at one or both datasets may explain the discrepancy. We estimated the efficacy of each sgRNA in both libraries using Azimuth 2.0 29 which uses only information about the genome in the region targeted by the sgRNA. We found that among genes identified as common dependencies in either dataset, mean sgRNA depletion indeed had a strong relationship to its Azimuth estimated efficacy ( Fig. 4c ). When we examined SSDs, we found that reagent efficacy likely explains some differences, for example in EIF3F (common essential in Sanger screens, 13 nonscoring in Broad screens) and MDM2 (strongly selective in Broad screens, correlated but not strongly selective in Sanger screens) ( Fig. 4d ).
We next investigated the role of different experimental timepoints on the screens' agreement. Given that the Broad used a longer assay length (21 days versus 14 days) we expected differences to be observed between late dependencies across the datasets. Therefore, we compared the distribution of gene scores for genes known to exert a loss of viability effect upon inactivation at an early-or late-time ( early or late dependencies ) 11 . While early dependencies have similar score distributions in both datasets (median average score -0.781 at the Sanger and -0.830 at the Broad), late dependencies are more depleted at the Broad with median average score -0.402 compared to -0.269 for the Sanger screens ( Fig. 5a ).
Unlike differences in sgRNA efficacy, timepoint effects are expected to lead to uniformly greater signal (typically depletion) in the Broad data and to be related to the biological role of late dependencies. We functionally characterized, using gene ontology (GO), genes that were exclusively detected as depleted in individual cell lines (at 5% FDR), in one of the two studies, excluding genes with significantly different sgRNA efficacies between libraries. Results showed 29 gene ontology categories significantly enriched in the Broad-exclusive dependencies genes (Broad-exclusive GO terms) for more than 50% of cell lines ( Fig. 5b and Supplementary Table 8) . The Broad-exclusive enriched GO terms included classes related to mitochondrial and RNA processing gene categories and other gene categories previously characterized as late dependencies 11 . In contrast, no GO terms were significantly enriched in the Sanger-exclusive common dependencies in more than 30% of cell lines.  replication screens, we found that this effect is principally due to library choice, with time-point playing a smaller role ( Fig. 6a , Supplementary Fig. 5a ). Changing from Sanger to Broad clones of HT-29 had minimal impact. We examined the change in gene score profile for each screen caused by changing either library or time-point while keeping other conditions constant. Gene score changes induced by either library or timepoint alterations were consistent across multiple conditions ( Fig. 6b ). Sanger-exclusive common dependencies were strongly enriched for genes that became more depleted with the KY library, and Broad-exclusive common dependencies were enriched among genes more depleted with the Avana library ( Supplementary Fig. 5b ). Late dependencies were strongly enriched among genes that became more depleted in the later time-points, while early dependencies were not ( Supplementary Fig. 5c ). We  Replicates were then median-collapsed to produce gene-by cell-line matrices.
"Processed" Gene Scores Broad gene scores are taken from avana_public_19Q1 gene_effect 30 and reflect CERES 31 processing.
The scores were filtered for genes and cell lines shared between institutes and with the unprocessed data, then shifted and scaled so the median of nonessential genes in each cell line was 0 and the median of essential genes in each cell line was -1 20 . Sanger gene scores were taken from the quantile-normalized averaged log fold-change scores and globally rescaled by a single factor so that the median of essential genes across all cell lines is negative one. 20 "Batch-Corrected" Gene Scores The unprocessed sgRNA log FCs were mean collapsed by gene and replicates. Data were quantile normalized for each institute separately before processing with ComBat using the R package sva. One batch factor was used in ComBat defined by the institute of origin. The ComBat corrected data was then quantile normalized to give the final batch-corrected data set.

Alternate Conditions
Screens with alternate libraries, cell lines, and timepoints were processed similarly to the "Unprocessed" data above.

Gene Expression Data
Gene expression log 2 (Transcript per million+1) data was downloaded for the Broad from Figshare for the Broad data set. For the Sanger dataset, we used reads per kilobase million (RPKM) expression data from the iRAP pipeline. We added a pseudo-count of 1 to the RPKM values and transformed to log 2 . Gene expression values are quantile normalized for each institute separately. For the Sanger data, Ensembl gene ids were converted to Hugo gene symbols using BiomaRt package in R.

Guide Efficacy Estimates
On-target guide efficacies for the single-target sgRNAs in each library were estimated using Azimuth 2.0 29 against GRCh38.

Comparison of All Gene Scores
Gene scores from the chosen processing method for both Broad and Sanger were raveled and Pearson correlations calculated between the two datasets. 100,000 gene-cell line pairs were chosen at random and density-plotted against each other using a Gaussian kernel with the width determined by Scott's rule 32 . All gene scores for essential genes were similarly plotted in Fig. 1b .

Comparison of Gene Means
Cell line scores for each gene in both Broad and Sanger datasets with the chosen processing method were collapsed to the mean score, and a Pearson correlation calculated.

Gene Ranking, Common Essential Identification
For each gene in the chosen dataset, its score rank among all gene scores in its 90th percentile least depleted cell line was calculated. We call this the gene's 90th percentile ranking. The density of 90th-percentile rankings was then estimated using a Gaussian kernel with width 0.1 and the central point of minimum density identified. Genes whose 90th-percentile rankings fell below the point of minimum density were classified as essential .

Identification of Selective Gene Sets
Selective dependency distributions across cell lines are identified using a "Likelihood Ratio Test" as described in McDonald et al 22 . For each gene, the log-likelihood of the fit to a normal distribution and a skew-t distribution is computed using R packages MASS 33  For the comparison of cell line profiles agreement in relation to initial data quality. First, to estimate the initial data quality we calculated True Positive Rates (TPRs, or Recalls) for the sets of significant dependency genes detected across cell lines, within the two studies. To this aim, we used as positive control a reference set of a priori known essential genes 12 . We assessed the resulting TPRs for variation before/after batch correction, and for correlations with the inter-study agreement.

Biomarker Analysis
We used binary event matrices based on mutation data, copy number alterations, tissue of origin and MSI status. The resulting set of 587 features were present in at least 3 different cell lines and fewer than 144.
We performed a systematic two-sample unpaired Student's t-test (with the assumption of equal variance between compared populations) to assess the differential essentiality of each of the SSD genes across a dichotomy of cell lines defined by the status (present/absent) of each CFE in turn. From these tests we obtained p-values against the null hypothesis that the two compared populations had an equal mean, with the alternative hypothesis indicating an association between the tested CFE/gene-dependency pair.
P-values were corrected for multiple hypothesis testing using Benjamini-Hochberg. We also estimated the For the analysis involving transcriptional data, we used the RNA-seq data from each institute for overlapping cell lines, which includes some sequencing files that have been used by both institutes and processed separately.
Rank-based dependency significance and agreement quantification To identify significantly depleted genes for a given cell line, we ranked all the genes in the corresponding essentiality profiles based on their depletion logFCs (averaged across targeting guides), in increasing order. We used this ranked list to classify genes from two sets of prior known essential ( E ) and non-essential ( N ) genes, respectively 12 .
For each rank position k , we determined a set of predicted genes P(k) = {s ∈ E ∪ N : ϱ(s) ≤ k} , with ϱ(s) indicating the rank position of s , and the corresponding precision PPV(k) as:

PPV(k) = |P(k)∩ E| / |P(k)|
Subsequently, we determined the largest rank position k* with P(k*) ≥ 0.95 (equivalent to a False Discovery Rate (FDR) ≤ 0.05). Finally, a 5% FDR logFCs threshold F* was determined as the logFCs of the gene s such that ϱ(s) = k* , and we considered all the genes with a logFC < F* as significantly depleted at 5% FDR level. For each cell line, we determined two sets of significantly depleted genes (at 5% FDR): B and S , for the two compared datasets, respectively. We then quantified their agreement

Relating sgRNA Depletion and Efficacy
We chose the set of genes found to be essential in at least one unprocessed dataset. The log fold-change of guides targeting those genes in each dataset was calculated and compared to the guide's estimated on-target efficacy.
Timepoint Gene Ontology Analysis We tested for enrichment of GO terms associated with genes showing a significant depletion in only one institute. To rule out the differences due to library, genes with significantly different guide efficacies were filtered from the analysis. Using the Azimuth scores average (mean) efficacy scores for each gene at each institute were calculated. A null distribution of differences in gene efficacy was estimated using genes not present in either institute specific sets (which were defined as depleted in at least 25% of cell lines). Institute specific genes greater than 2 standard deviations from the mean of the null distribution were removed.
For the filtered gene set prior known essential and non-essential gene sets from 37 were used to find significant depletions for each cell line and institute at 5% FDR. For each cell line, the genes identified as significantly depleted in only Broad or only Sanger were functionally characterized using Gene Ontology (GO) enrichment analysis 38 . To this aim, we downloaded a collection of gene sets (one for each GO category) from the Molecular Signature Database (MsigDB) 39 , and performed a systematic hypergeometric test to quantify the over-representation of each GO category for each set of study-exclusive dependency genes, per cell line. We corrected the resulting p-values for all the tests performed within each study using the Benjamini-Hochberg procedure 40 , and considered a GO category enriched in a cell line if the corrected p-value resulting from the corresponding test was < 0.05.

Principal Component Analysis of the Batch Effect and Alternate Conditions
The Broad and Sanger unprocessed gene scores and the gene scores for the alternate conditions tested by both institutes were concatenated into a single matrix with a column for each screen. Principal components were found for the transpose of this matrix, where each row is a screen and each column a pseudogene. Components 1 and 2 were plotted for all original screens and the alternate screens for either HT-29 (Fig. 6a) or JIMT-1 ( Supplementary Fig. 6a ). The aspect ratio for the plot was set to match the relative variance explained by the first two principal components. 26

Consistency of Timepoint and Library Effects on Gene Scores
To evaluate library differences, we took all screens that had been duplicated in each library with all other conditions (timepoint, clone, and screen location) kept constant. For each of these screens, we subtracted the gene scores of the version performed with the KY library from the version performed with the Avana library to create library difference profiles. For the case of Sanger's day-14 KY screen of the Sanger HT-29 clone, two versions exist, the original and an alternative that was eventually grown out to 21 days.
We used the alternate version of this screen to be consistent with the day 21 results. A correlation matrix of library difference profiles was then calculated and is plotted in the left of Fig. 6b. The procedure was repeated for timepoint differences, creating timepoint difference profiles by subtracting day 14 results from day 21 results for pairs of screen readouts that differed in timepoint but not library, clone, or screen location.

Conditions
For the cell line HT-29, we took Sanger's original screen as a baseline. We then subtracted from this baseline from four Broad HT-29 screens: the original (Avana library at day 21), then with the Avana library at day 14, the KY library at day 21, and the KY library at day 14, generating four arrays indexed by gene which form the y-axes in the succession of plots in Fig. 6c. We also computed the mean score of each gene across all original Broad screens and subtracted it from the mean score of each gene across all the original Sanger screens to form the x-axis of all four plots. For each condition, the standard deviation of the HT-29 screen differences (y-axes) was computed along with the correlation of the HT-29 screen differences with the mean differences (x-axis). The plots themselves are Gaussian kernel density estimates. We repeated this process for JIMT-1 ( Supplementary Fig. 6d ) and then for HT-29 while swapping the roles of Broad and Sanger (Fig. 6d). For the Sanger alternate condition screens we used the Sanger clone of HT-29, and for its day 14 KY screen we used the Sanger's original HT-29 screen. 27 Replication Experiments The replication screens at Broad and Sanger were performed using the normal current protocol of the respective institution 31 except with respect to the specifically noted changes to library (and the associated primer sequences required for post-screen amplification of the sgRNA barcodes) and the timepoint.

Data Availability
The data used for this manuscript have been posted to Figshare 41 .