## Introduction

Genetic screens using pooled CRISPR/Cas9 libraries are functional genomics tools that are becoming increasingly popular throughout the life sciences. Using these screens, researchers are able to find novel molecular mechanisms, better understand complex cellular systems, or find new drug targets1,2,3. Analysis of the raw sequencing output of these screens is a non-trivial task, given the size and diversity of these datasets. MAGeCK4 was the first bioinformatic workflow specifically aimed at analysis of CRISPR/Cas9 screens and has been used in numerous genome-wide screening studies since. However, while being a standard solution in the field, MAGeCK has to be run from a command line, and requires manual definition of the position of the 20 bp single guide RNA (sgRNA) sequence for proper read identification. It, thus, requires familiarity with working from a command-line, basic handling of raw fastq files, as well as knowledge of the read sequence composition, all of which might constitute obstacles to many biologists. BAGEL5 is an alternative command-line workflow, powerful in analyzing gene essentiality screens, but inapplicable to other screening experiments, such as resistance screens or reporter-based screens. caRpools6 is a R-package offering multiple analysis options, but installation and execution require proficiency with the R platform. Finally, the recently developed CRISPRcloud7 is an analysis workflow that runs as a web-service, thus offering superior ease of use. However, CRISPRcloud requires manual definition of a fixed 20 bp window (idential for each sample) to extract the sgRNA sequence from each read file. This makes the workflow incompatible with cases where the position of the sgRNA sequence varies between reads, for example if read staggering is used. Read staggering is a common technique and recommend by many sequencing facilities to increase sequencing yield as it mitigates the low base diversity problems in the initial sequencing cycles of PCR amplicons libraries like the ones generated in CRISPR/Cas9 screens8,9,10 (Fig. 4A). As a consequence, sequencing analysis of CRISPR/Cas9 screens still remains challenging for most laboratories since prior solutions are either hard to access for non-bioinformatic experts, or might not provide sufficient support for the datasets generated. The web-service we present, PinAPL-Py, addresses these issues by providing a comprehensive analysis workflow running through an intuitive web-interface that supports a broad class of CRISPR/Cas9 screening experiments as well as staggered reads. This facilitates standardized, reproducible data analysis that can be carried out directly by the scientists conducting the experiments.

## Methods

### Workflow description

PinAPL-Py is designed to analyze a generic layout of CRISPR/Cas9 screening experiments where a control group is compared to one or more conditions which can, for example, refer to cells exposed to a chemical compound, samples taken at different time points, or cells expressing a fluorescent reporter and sorted by FACS (Fig. 1A). PinAPL-Py is written in Python 2.7 and implements well-established methods to provide a fully automated workflow, taking the user from the raw sequence files to a list of candidate sgRNAs and genes while requiring only minimal manual input. In particular, the workflow comprises the following steps (Fig. 1B):

#### Sequence quality check

Sequence quality assessment is carried out using fastqc (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), which analyzes sequence composition, sequencing quality, and read depth.

Read alignment is carried out using Bowtie2 12 in the “local” mode. Bowties alignment parameters (seed length (-L), seed number (-N), and seed interval function (-i)) are set to values producing optimal results for 20 bp sgRNA sequences, but can be changed on the configuration page. A technical description of these parameters is available in the Bowtie2 manual (http://computing.bio.cam.ac.uk/local/doc/bowtie2.html#what-is-bowtie-2). After completion, alignments are processed based on the best (=primary) and second-best (=secondary) alignment scores achieved for each read (referred to as ‘AS’ or ‘XS’, respectively, in the Bowtie2 SAM output), and categorized as either “unique”, “tolerated”, “ambiguous”, or “failed”. An alignment is considered failed if either no primary score is reported by Bowtie2, or if the primary score reported is below the critical matching threshold (set to perfect matching in the default setting, but adjustable on the configuration page). An alignment is considered ambiguous if the difference between primary and secondary alignment is below a critical ambiguity threshold (adjustable on the configuration page). If this difference is above the threshold, the read is mapped to the primary match and classified as “tolerated”. If a secondary score is not reported (and the primary score satisfies the matching threshold), the alignment is considered unique. Only unique and tolerated reads are kept for further processing, while failed and ambiguous reads are being discarded.

#### Analysis of read count distributions

For each sample, descriptive statistics of normalized read counts are computed and reported on the results page as boxplots, histograms and various measures, such as median, standard deviation, maximum, minimum, sgRNA/gene representation, and the Gini coefficient.

In order to generate a candidate list of individual sgRNAs showing either the most significant enrichment or depletion (compared to the control samples), PinAPL-Py models normalized read counts as following a negative binomial distribution14, similarly to other read count analysis workflows4,15,16. For this, the average (normalized) read count μi and sample variance σi 2 for each sgRNA i is computed across all control replicates. The dispersion parameter D of the negative binomial distribution is then estimated via linear regression since σ2 = μ + D μ2 holds in a negative binomial model. With this estimated dispersion, a model variance $${{\hat{\sigma }}_{{\rm{i}}}}^{2}$$ is computed for each sgRNA via $${{\hat{\sigma }}_{{\rm{i}}}}^{2}$$ = μi + D μi 2. Thus, the read count of each sgRNA i, under the control condition, is modeled to follow a negative binomial distribution with parameters NB(μi, $${{\hat{\sigma }}_{{\rm{i}}}}^{2}$$).

#### sgRNA enrichment/depletion analysis

sgRNA counts from each treatment sample are evaluated on the basis of the control read count distribution, estimated in the previous step. In particular, for each normalized read count ci from a treatment sample, the fold-change to the control mean is taken, and a p-value is computed from NB(μi, $${{\hat{\sigma }}_{{\rm{i}}}}^{2}$$) to test either for ci > μi (in an enrichment screen) or ci < μi (in a depletion screen). Finally, these p-values are adjusted for multiple hypothesis tests using the Benjamini-Hochberg FDR procedure. Note that for the computation of p-values, PinAPL-Py requires the user to specify at least two control samples (only fold-change is reported otherwise). The results are returned to the user as sortable tables on the results page, and as spreadsheets after download. In addition, PinAPL-Py produces scatterplots of normalized counts for each treatment sample versus the control averages on which sgRNAs of individual genes can be interactively highlighted on the results page. Results of the sgRNA enrichment/depletion analysis are reported for each treatment sample separately, and correlation analysis is carried out for samples representing replicates of the same treatment.

#### Sample clustering

Unsupervised clustering analysis of all samples is done based on either the most variable or most abundant/depleted (depending on screen type) sgRNAs throughout the dataset. For most variable sgRNAs, the n sgRNAs with the highest variance across all samples are extracted (n = 25 by default, adjustable on the configuration page). For highest abundance/depletion, the n sgRNAs with the highest/lowest read counts are extracted for each sample, and the union set of all these sgRNAs is used for the clustering. Clustering is done through the gplots package in R.

#### Gene ranking

From the sgRNA enrichment/depletion analysis described above, a gene ranking is derived by combining the fold-change data from all sgRNAs targeting a single gene through one of two possible methods:

• Adjusted Robust Rank Aggregation (αRRA): This option follows the αRRA procedure described previously4. Individual sgRNAs are required to show enrichment/depletion at significance level of P 0  = 0.05−1 (by default, adjustable on the configuration page) in order to be taken into account for computation of the gene ranking score. Genes having no significant sgRNAs are given a score of 1.0; p-values for the gene ranking score are computed based on randomly assigning sgRNAs to target genes and running permutations (Np = 1,000 by default) to estimate the empirical distribution function.

• STARS: This option calls the STARS function to compute gene ranking scores and p-values based on a binomial model (http://portals.broadinstitute.org/gpp/public/software/index) Technical details are provided in the original publication17.

### Benchmarking & Quality control

#### Sequencing data

Datasets for comparisons with MAGeCK and CRISPRcloud were taken from CRISPR/Cas9 screening experiments in our laboratory (unpublished). In these, an A375 cell population was transduced with the full human GeCKOv2 library at ~100x coverage and split into treatment and control samples. Treatment samples were incubated with a toxin at ~90% lethal dose. Control samples were incubated with PBS. After 4 days, cells were harvested, sgRNA sequences were extracted with nested PCR10, and sequenced at a minimum depth of 4 million reads per sample.

#### Sequencing data analysis

PinAPL-Py was run with gene ranking set to “αRRA” with 1,000 permutations. Matching threshold was set to 40 (perfect match) with ambiguity threshold set to 2 (Fig. 2). Read count normalization was set to “total”. All other parameters were left at default. To generate the ranked list of top 25 genes (Fig. 3E), the gene list was sorted on two levels: First by αRRA FDR (low to high), and second by −log αRRA score (high to low).

MAGeCK v0.5.6 was run following the instructions on sourceforge (https://sourceforge.net/p/mageck/wiki/Home/#usage). Normalization was set to “total”. To generate the ranked list of top 25 genes (Fig. 3E), the list was sorted on two levels: first by FDR (“pos|fdr”, low to high), and second by −log αRRA score (“pos|score”, high to low).

CRISPRcloud (http://crispr.nrihub.org/) was run in the “Survival and Drop-out” mode, with analysis algorithm set to “DESeq2” (since this option is most comparable to the negative binomial model implemented in PinAPL-Py and MAGeCK). To generate the ranked list of top 25 genes (Fig. 3E), the gene list was sorted on three levels: 1: FDR (low to high), 2: hit ratio (HR, high to low), and 3: average sgRNA log10 fold-change (AVGFC, high to low). Venn diagrams were generated at http://bioinformatics.psb.ugent.be/webtools/Venn/. For the comparisons between tools in Fig. 3, a dataset without staggers in the reads was used.

### Availability of data and materials

The raw data read files are available in the NCBI Sequence Read Archive (SRA: SRP123360; BioProject: PRJNA416669). The source code is available at https://github.com/LewisLabUCSD/PinAPL-Py.

## Results and Discussion

### Comparison to other tools

We compared PinAPL-Py to MAGeCK and CRISPRcloud. A dataset from an unpublished CRISPR/Cas9 toxin resistance screen was used. Knock-out of the heparan-binding epidermal growth factor (HBEGF) is expected to confer resistance to the toxin used in our dataset, making this gene a positive control for the comparison. A comparison to caRpools was not possible since the tool could not be run successfully and support was no longer available.