Targeted and genome-wide sequencing reveal single nucleotide variations impacting specificity of Cas9 in human stem cells

CRISPR/Cas9 has demonstrated a high-efficiency in site-specific gene targeting. However, potential off-target effects of the Cas9 nuclease represent a major safety concern for any therapeutic application. Here, we knock out the Tafazzin gene by CRISPR/Cas9 in human-induced pluripotent stem cells with 54% efficiency. We combine whole-genome sequencing and deep-targeted sequencing to characterise the off-target effects of Cas9 editing. Whole-genome sequencing of Cas9-modified hiPSC clones detects neither gross genomic alterations nor elevated mutation rates. Deep sequencing of in silico predicted off-target sites in a population of Cas9-treated cells further confirms high specificity of Cas9. However, we identify a single high-efficiency off-target site that is generated by a common germline single-nucleotide variant (SNV) in our experiment. Based on in silico analysis, we estimate a likelihood of SNVs creating off-target sites in a human genome to be ~1.5–8.5%, depending on the genome and site-selection method, but also note that mutations might be generated at these sites only at low rates and may not have functional consequences. Our study demonstrates the feasibility of highly specific clonal ex vivo gene editing using CRISPR/Cas9 and highlights the value of whole-genome sequencing before personalised CRISPR design.


Supplementary Figure 1| Experimental design
Cas9 was inserted in the genome of hiPSCs through PiggyBac transposon. Cells were transfected with TAZ-targeting gRNA and Cas9 induced by doxycycline. Single cells were sorted and expanded to colonies. Single-cell derived colonies were genotyped for on-target effect in TAZ gene. Whole-genome Illumina libraries were generated from cloned cells and sequenced. The control sample was transfected with a gRNA that has a minimum of 4 mismatches against the human genome reference. Samples were compared to the control clone to filter out preexisting mutations. We do not observe any sequence similarity to TAZ on-target site in the flanking regions (±100bp) of 2bp deletion detected in clone TAZ4 (chr2: 24260743; CGAC). Strict linear relations govern reduction of Cas9 specificity due to genome variation. All Cas9 target sites with no off-targets closer than three mismatches were identified in the human exome. Shown is the fraction of target sites for which SNVs, identified from whole-genome sequencing, converted a three-mismatch off-target to a two-mismatch off-target site, where target sites are binned by number of three-mismatch off-target sites. TAZ-knockout experiments described in the main text were conducted using the PGP1 cell line, which is of European ancestry. For comparison, this analysis was repeated using an SNV list obtained from an individual of African ancestry (NA19240). As expected from the elevated variation rate in African genomes, the NA19240 genome exhibits a higher rate of loss of Cas9 specificity as seen in the larger regression slope. The sequences of off-target site are listed in Table S1. The negative control data was obtained from deep sequencing of same cell line without Cas9 induction and gRNA transfection. The frequency of 0.17% observed in the negative control seems to be largely contributed by sequencing errors 2 . Thus, the indel frequencies observed for all off-target are adjusted by subtracting the background indel rate of 0.17%. Based on this, we estimated the average offtarget effect at the 32 three-mismatch sites to be ~ 0.15%. Of note, it was repeatedly demonstrated that the level of Cas9 activity at potential off-target sites is reversely correlated to the number of mismatches and Cas9 has been shown to not cut sites with >3 mismatches 3,4 . Therefore we do not expect significant Cas9 effects at off-target sites with >3 mismatches.

Site
Total reads # Indel reads # Indel frequency TAZ-target site (negative control)

Supplementary Table 3| Primer sequences with Illumina adaptors (lower case) used for deep sequencing
We followed the protocol as described before by Yang et al. 2 .

Supplementary Note 1 | Impact of SNVs on targets chosen by a Cas9 site specificity algorithm
Cas9 targets are frequently selected using software that attempts to enforce specificity constraints, frequently by filtering out candidate targets for which there are similar sequences in a genome that have potential to function as Cas9 off-targets [5][6][7][8][9] . There is reason to expect that sites chosen by these algorithms will be less susceptible to acquisition of active off-targets though the presence of SNVs. As seen in Figure 3a of our article, if a target has n similar sequences in the genome at a given Hamming distance, the chance that a SNV will convert one of these sequences to one with a smaller Hamming distance (and therefore to a more likely active Cas9 off-target) is proportional to n. By their filtering, Cas9 specificity algorithms will likely identify targets with smaller values of n than targets chosen without the algorithms, and thus smaller probabilities of a site acquiring a closer off-target through a SNV. However, it is also possible that Cas9 site selection algorithms bias sites towards sequences with base compositions that have different propensities to harbor SNVs. We explored these hypotheses using CasFinder 9 , an algorithm in use in our laboratory, as a representative Cas9 target selection algorithm.
We started with a whole human exome database of 927104 Cas9 target sites (896658 unique) previously generated using CasFinder 9 . Because CasFinder's site evaluation differs from the specificity conditions of the TAZ gene Cas9 target, these sites were re-evaluated using bowtie 10 to find a subset meeting comparable conditions and to identify potential Cas9 off-targets of sites in this subset. (We note that the TAZ site tested in the article was not chosen by CasFinder and, indeed, failed its specificity tests.) Specifically: (i) In a version of the method applied in ref. 11 , we ran bowtie with the -a and -v 3 options to query for all 22bp sequences in the genome that differed from any of CasFinder sites by up to three places excluding consideration of the N position of the NGG PAM. This was effected by issuing four 22bp bowtie queries per CasFinder site comprising the first 19bp of the CasFinder 20bp targeting sequences plus each of the four PAM sequences AGG, CGG, GGG, and TGG.
For brevity we refer to the sequences returned by bowtie for each of the four queries per site as 'hits' in the genome for that site. Nineteen bp targeting sequences were used in the queries because the TAZ gene gRNA only contained a 19bp vs. 20bp targeting sequence. The bowtie options and multiple queries ensured that all sequences in the genome with up to 3 mismatches from the 19bp targeting sequence and followed by either an AGG, CGG, GGG, or TGG would be returned as hits for each CasFinder site. Some hits could be returned by multiple queries, and such duplicates were ignored. The set of hits for each site thus includes the space of 3bp mismatch off-targets that could be changed to <3bp mismatch off-targets by a SNV, matching the conditions observed for the TAZ gene gRNA, but additional relevant hits could also be returned such as those in which mismatches were found in the GG positions of the PAM. (ii) We then identified those CasFinder sites that lacked 2 bp mismatch off-targets in the genome, matching the specificity characteristics of the TAZ gRNA site. To do this, we used custom Perl code to classify each hit for a site as either an "excluded off-target" or a "non-excluded off-target", excepting the hit corresponding to the exact match with the site, which was ignored in all follow-on processing. Hits were considered 'excluded' off-targets if they contained a mismatch from the GG of the NGG PAM, or if the total number of mismatches to the 19bp targeting sequence in the query was 3. All other hits were considered 'nonexcluded'. The subset of 611589 sites for which all off-targets were "excluded" was considered the "accepted subset" that matched the specificity conditions of the TAZ gRNA, and this was used in subsequent processing. The hg19 human genome reference sequence downloaded from the UCSC Genome Browser web site 12 was used for both CasFinder and bowtie analysis.
Whole-genome genome sequence variation files in GFF format were prepared for the PGP1 and NA19240 genomes from data available from ref. 13 using data and software resources from ref. 14 . SNV and SUB variations were extracted from these files for further use, and indels and other entries were ignored. A second custom Perl program was then used to alter the sequences of all the hits for the accepted subset of CasFinder sites to incorporate the derived alleles of each SNV or SUB variation that had a non-empty intersection with the hit. For each such alteration, the program re-evaluated the altered hit sequence according to the criteria for exclusion described above to see if the adjusted hit sequences remained excluded off-targets.
Sites for which any excluded off-target became non-excluded were counted as instances where a genome variation could acquire a Cas9 off-target in the manner observed for the TAZ gRNA. These sites are denoted "converted" sites, and their rates are indicated in Figure 3a of the article. We found that 1.558% of all accepted CasFinder sites were converted by the PGP1 genome. For consistency with our analysis of exonic Cas9 sites that were not chosen by a Cas9 target selection algorithm (see Supplementary Note 2), we recalculated the conversion rate using only SNV variations, and only using excluded off-targets that had a total of 3 mismatches to the original site sequence (including PAM GG mismatches). This yielded a slightly reduced conversion rate of 1.541%. The same analyses were conducted with the NA19240 genome and yielded higher conversion rates of 1.981% and 1.961%, respectively, consistent with the elevated variation rate in genomes of African ancestry (see Supplementary Figure 7).
We then constructed a frequency distribution of counts of 3bp mismatch off-targets per sites for all accepted CasFinder sites, using the criteria of Supplementary Note 2 as noted above. This is depicted in Figure 3b in the article. Consistent with the hypothesis above, CasFinder-selected sites exhibit a distribution that is much more concentrated at lower values than the corresponding distribution constructed from the data of Supplementary Note 2. The mean  st. dev. of the complete distribution is 16.29  13.84; however, to be consistent with the analysis of the unbiased target set from Supplementary Note 2, we focused on the 611051 sites (99.9% of the full distribution) with 3bp mismatch off-target counts in the range 0-100, for which the mean  st. dev. was 16.17  12.56. A two-sided Z-test comparison of the mean number of off-targets for sites in in this range between the CasFinder and the unbiased sets yields Z = 1141.4 (P = 0, 2-sided, computed in Excel).
Finally, we computed the actual fractions of sites in each of off-target count bin that acquired 2bp mismatch off-targets for the PGP1 genome, and computed the regression in Figure 3a in Excel. The regression slope is .08%, much lower than the .1451% of the Cas9 sites of Supplementary Note 2. The R 2 of this regression is 0.4795, considerably lower than the .9975 of the unbiased Cas9 sites. Statistical noise due to the low numbers of sites with high numbers of off-targets likely accounts for much of this effect.
The fact that CasFinder dramatically reduces the number of potential off-target sequences per site compared to choosing sites without a selection algorithm is consistent with the hypothesis stated at in the introduction to this Supplementary Note that site selection algorithms can lower the probability of a site acquiring an off-target through a SNV simply by reducing the number similar sequences that could be converted by a SNV. However, the fact that CasFinder sites exhibit a lower regression slope compared to unbiased sites suggests that additional mechanisms may play a role in the reduction of SNV-mediated off-target acquisition, for off-target acquisition rates are lower even for sites with off-target counts in the range 0-30, in which the CasFinder set has a much larger fraction of sites than the unbiased set ( Figure  3b). This is consistent with the second hypothesis mentioned in the introduction, that site selection algorithms such as CasFinder may chose sites with a sequence composition that has a lower propensity to harbor SNVs, for instance, by biasing site selection to sequences with lower CpG frequencies that have elevated frequencies of SNVs 15 . We have not explored these possibilities at this time.
Finally, we note that, independent of the effect of algorithm-imposed specificity checks, there are additional reasons to expect that the conversion rates found by these analyses underestimate actual offtarget conversion rates. Specifically, the whole genome sequencing of PGP1 and NA19240 did not call 100% of these genomes, so the variation files for these genomes must also be considered incomplete. Also, as previously noted, indels in these genomes were not considered in the simulation.

Supplementary Note 2 | Impact of SNVs on targets using an unbiased set of gRNA sequences
For most applications, algorithms such as CasFinder would typically be employed to identify target sites with minimal predicted off-target effects. However, there are instances for which the required target site may have unavoidable off-target effects. As such, it would be important to understand the impact of single nucleotide variations in introducing additional off-targets across an unbiased set of Cas9 sites with varying levels of specificity.
First, using the RefSeq annotation of genes, the sequences of all coding exons were obtained from the RefFlat table in the UCSC Genome Browser database. 12 Next, custom python scripts were used to identify all unique, putative N19-NGG sites within these sequences which resulted in the identification of 1922668 unique sites. Subsequently, these sequences were aligned to the hg19 version of the human genome using the SeqMap algorithm 16 allowing for up to four mismatches and obtaining all possible alignments. Four mismatches were used in the alignment in order to filter the alignments for the wildcard base at position 20 (N before the GG). That is, if an off-target was called with four mismatches with one of the mismatches at position 20, it was considered an off-target with three mismatches. Thus, for each of the unique sites, the number of three mismatch off-target sites was determined. The distribution of numbers of such off-targets has a heavy right tail, and we focused mainly on the 1431909 sites with between 0 and 100 three mismatch off-target sites (74.5% of the full set). The distribution of sites with numbers of off-targets in this range is depicted in Figure 3b. The mean  std. dev. of the number of three mismatch off-targets per site in this set was 46.84  25.77 (compared to 124.31  2399.57 for the full set). Using only single nucleotide variations from either PGP1 or NA19240, we then calculated the percentage of Cas9 sites that had at least one three mismatch off-target become a two mismatch off-target due to a single nucleotide variation in the respective genome. Using the 101 data points, a linear regression was performed using Microsoft Excel where a slope and R 2 value were determined. A very strong linear relationship was observed between the number of three mismatch off-targets and the percentage of Cas9 sites in a respective bin being affected by a single nucleotide variation. The slope of the linear relationship varies depending on the genome, likely due to the different rates of SNVs in individuals.