BATCH-GE: Batch analysis of Next-Generation Sequencing data for genome editing assessment

Targeted mutagenesis by the CRISPR/Cas9 system is currently revolutionizing genetics. The ease of this technique has enabled genome engineering in-vitro and in a range of model organisms and has pushed experimental dimensions to unprecedented proportions. Due to its tremendous progress in terms of speed, read length, throughput and cost, Next-Generation Sequencing (NGS) has been increasingly used for the analysis of CRISPR/Cas9 genome editing experiments. However, the current tools for genome editing assessment lack flexibility and fall short in the analysis of large amounts of NGS data. Therefore, we designed BATCH-GE, an easy-to-use bioinformatics tool for batch analysis of NGS-generated genome editing data, available from https://github.com/WouterSteyaert/BATCH-GE.git. BATCH-GE detects and reports indel mutations and other precise genome editing events and calculates the corresponding mutagenesis efficiencies for a large number of samples in parallel. Furthermore, this new tool provides flexibility by allowing the user to adapt a number of input variables. The performance of BATCH-GE was evaluated in two genome editing experiments, aiming to generate knock-out and knock-in zebrafish mutants. This tool will not only contribute to the evaluation of CRISPR/Cas9-based experiments, but will be of use in any genome editing experiment and has the ability to analyze data from every organism with a sequenced genome.


Validation performance of BATCH-GE
To display the potential of BATCH-GE, two genome editing experiments were conducted in zebrafish. First, we aimed to determine optimal experimental conditions for sgRNA efficiency testing in zebrafish embryos. For this purpose, sgRNAs targeting five different zebrafish genes (slc2a10, pls3, tapt1a, myt1la, tprkb), were designed and produced according to an in-house developed workflow using synthetic dsDNA templates ( Supplementary Fig. S3). For these assays, both the most optimal ratio of sgRNA to Cas9 and the most relevant developmental time point for indel efficiency analysis were determined. This was achieved by injecting 4 different combinations of sgRNA (10 and 25pg) and Cas9 (100 and 250pg) quantities in one-cell stage zebrafish embryos, followed by DNA extraction of a pool of injected embryos at 1, 2, 3 and 4 days post fertilization (dpf). Genome editing assessment using the BATCH-GE tool revealed that combinations of 10 pg sgRNA + 250 pg Cas9 and 25 pg sgRNA + 250 pg Cas9 resulted in the highest indel frequencies, suggesting that Cas9 is the determining factor when aiming to achieve high indel rates ( Supplementary Fig. S2). These results correspond to earlier findings in other organisms, showing a positive correlation between Cas9 quantities and indel efficiency [1][2][3][4] . Furthermore, the data shows that genome editing analysis of DNA extracted at 1 dpf results in a reliable estimation of the indel efficiency at later stages during zebrafish development 5,6 .
In a second experiment, we aimed to introduce specific base pair alterations in the zebrafish tprkb gene. First, we validated and further optimized a protocol by Irion et al. (2014) 7 , describing a strategy for the achievement of CRISPR/Cas9-mediated precise genome editing by HDR in zebrafish. This strategy involves the use of a circular HDR template which was produced according to an in-house developed workflow using synthetic dsDNA templates ( Supplementary Fig. S3), and was co-injected with Cas9 nuclease and an sgRNA that was previously shown to be highly efficient in targeting the zebrafish tprkb gene (Supplementary table S3, design 2). After injection in the embryo, the circular template is cut at two sgRNA target sequences (+PAM) which are flanking a sequence, homologous to the genomic target site, hence providing a linearized HDR template 7 . In a second approach, short linear single-stranded oligodeoxynucleotides (ssODN) were screened for their suitability as HDR template 8,9 . Four ssODN, either sense or antisense relative to the sgRNA sequence identity and with 30 or 60 bp homology arms, that are flanking the theoretical CRISPR/Cas9 cut site, were designed ( Supplementary   Fig. S3). For both types of HDR templates, the highest amount that did not cause any toxic effects (plasmid: 100 pg, ssODN: 50-100 pg) was injected in one-cell stage zebrafish embryos together with 25 pg sgRNA and 250 pg Cas9. DNA was extracted at 1 dpf and analysed using NGS, followed by genome editing assessment using BATCH-GE. For precise genome editing analysis, BATCH-GE requires the specification of a repair template in the Experiment.csv file ( Supplementary Fig. S1). Brackets were placed around the 5 or 6 intended base pair alterations. By placing square or round brackets around these base pair substitutions, BATCH-GE is able to distinguish between the occurrence of a 'full' or a 'partial' repair. Square brackets indicate the base pair alterations that need to be introduced in the zebrafish genome while round brackets indicate alterations that do not necessarily need to be introduced in the genome, for instance base pair alterations that are used for codon optimization of the template. Reads that only contain the necessary alterations and reads that contain all the indicated base pair alterations are classified and counted as partial and full HDR events respectively. In general, three conclusions can be drawn from the BATCH-GE output (Supplementary Table S1). First, as already shown by Irion et al. (2014) 7 , HDR efficiencies are relatively low when using the described circular templates. Secondly, the use of ssODN repair templates leads to improved total repair efficiencies, similar to those described earlier 8,9 , especially when using templates with 60 bp homology arms.
Thirdly, no difference in HDR efficiency could be detected between sense and antisense ssODN HDR template molecules.  Tables   Supplementary Table 1  Designs with unique 20-mer and 12-mer seed sequence, will aid in avoiding off-target effects. 1,2,[10][11][12][13][14] Select designs at the 5' end of the gene. The aim is to terminate translation as early as possible and to induce the nonsense-mediated mRNA decay pathway. 15 Avoid selecting designs in exon 1.
A possible alternative first exon use will be avoided. In addition, first exons generally have a higher methylation rate, possibly interfering with sgRNA binding. 16,17 Select the designs with the highest GC percentage, but not higher than 80%.
High GC percentages were shown to result in higher indel rates. 15,18 Preferably select designs with G but not A directly upstream of PAM.
Accounting for the base pair directly upstream of PAM was shown to lead to higher indel rates. 18 Preferably select target sites in different exons and protein domains.
By selecting target sites in different regions of the gene, the chance of introducing indel mutations in crucial regions of the gene is increased (in case this is not completely known).