Development and performance of a targeted whole exome sequencing enrichment kit for the dog (Canis Familiaris Build 3.1)

Whole exome sequencing is a technique that aims to selectively sequence all exons of protein-coding genes. A canine whole exome sequencing enrichment kit was designed based on the latest canine reference genome (build 3.1.72). Its performance was tested by sequencing 2 exome captures, each consisting of 4 pre-capture pooled, barcoded Illumina libraries on an Illumina HiSeq 2500. At an average sequencing depth of 102x, 83 to 86% of the target regions were completely sequenced with a minimum coverage of five and 90% of the reads mapped on the target regions. Additionally, it is shown that the reproducibility within and between captures is high and that pooling four samples per capture is a valid option. Overall, we have demonstrated the strong performance of this WES enrichment kit and are confident it will be a valuable tool in future disease association studies.

Whole exome sequencing is a technique that aims to selectively sequence all exons of protein-coding genes. A canine whole exome sequencing enrichment kit was designed based on the latest canine reference genome (build 3.1.72). Its performance was tested by sequencing 2 exome captures, each consisting of 4 pre-capture pooled, barcoded Illumina libraries on an Illumina HiSeq 2500. At an average sequencing depth of 102x, 83 to 86% of the target regions were completely sequenced with a minimum coverage of five and 90% of the reads mapped on the target regions. Additionally, it is shown that the reproducibility within and between captures is high and that pooling four samples per capture is a valid option. Overall, we have demonstrated the strong performance of this WES enrichment kit and are confident it will be a valuable tool in future disease association studies.
S ince the first reported study on Whole Exome Sequencing (WES) in 2007 1 , well over 2000 papers have been published applying this technique (PubMed search: ''exome sequencing''). WES is a cost-efficient approach to selectively sequence the coding regions of the genome. This approach allows to identify most functional variation without the high costs associated with whole-genome sequencing (WGS). Unfortunately, predesigned and validated kits are only commercially available for human and mouse. Scientific reports of WES on other animals are scarce [2][3][4] . This is unfortunate, as for example the dog is an excellent animal model for comparative disease genetics 5 . To fill the gap, we designed a canine exome sequencing kit (based on build 3.1.72) and tested its performance using Illumina Sequencing.

Results
Design. We designed a canine whole exome sequencing enrichment kit based on the latest canine reference genome (Broad CanFam 3.1.72) 6 . This design was based on the combination of the Ensembl Genes, the RefSeq Genes and the mRNA annotation. Additionally, known microRNAs were added from miRBase. After merging overlapping regions, the total size of the design was 52,876,195 bp (<2% of the genome) divided over 203,059 regions. Based on our design, capturing baits were developed by Roche Nimblegen to target the specific regions. To avoid too much off-target sequencing, the most stringent setting was chosen for the baits design, allowing only unique matches from each bait to the reference genome.
Performance: coverage and specificity. To assess the performance of the WES enrichtment kit, two exome captures were done, each consisting of 4 pooled samples. Sequencing depth, coverage of targeted regions and targeted bases and specificity were assessed for every sample. The results are reported for a minimum coverage of one and five (as five was the threshold used for variant calling were retained ( Table 1). The average sequencing depth overall was 102x and ranged from 82.6x to 125.1x (Table 1).
Overall, an average of 92% of the regions were covered by at least one read and 90% by at least five reads. At a minimum coverage of one, 89 to 90% of the regions in our design, were completely covered. 83 to 86% of the regions were completely covered when a minimum coverage of five was applied ( Figure 1). A clear relationship exists between the percentage of each region being sequenced and the proportion of total regions being sequenced ( Figure 1). On average, a minimum coverage of 5 was not consistently reached throughout the entire region for 15% of the regions (Table 2). However, for only 8% of the regions on average, the maximum coverage never reached 5 ( Table 2).
When looking at the coverage of targeted bases instead of regions, 93 to 94% (<49 Mb) of the targeted bases (<53 Mb) were covered at least once and 89 to 91% were covered at least five times (Table 3). We also assessed the specificity (reads on target/total number of reads). With an average overall specificity of 90%, off-target sequen-cing is rather small and comparable with earlier reports 7 . The specificity was also assessed in every single sample per chromosome. For all eight samples, results were similar, with the highest specificity on average found on chromosome nine (94.05%) and the lowest specificity found on average on chromosome 22 (84.09%). Per chromosome specificity is available in Supplementary Table S1.
Performance: reproducibility. The reproducibility within and between captures was checked by comparing the amount of targeted bases and regions that are sequenced at least once and five times in every single sample. Overall, from the <53 Mb target base pairs, 48,141,464 base pairs (91.0%) were sequenced at least once in all eight samples. We also assessed how many base pairs were never sequenced. Overall, 2,313,892 base pairs (4.4%) were never covered. Overall, the remaining 4.6% of the total target base pairs are being sequenced variably. Comparing the 4 samples within each capture, we found that 48,333,432 (91.4%) or 48,663,244 (92.0%) base pairs were common and 2,816,548 (5.3%) or 2,553,759 (4.8%) base pairs Sample pooling. Pooling several samples together prior to capturing is common practice, mainly to reduce cost. Of course, pre-capture pooling should only be done when it does not significantly decrease the enrichment performance. To check the effect of pre-capture pooling, we created subsets containing 25% randomly chosen reads out of the total number of reads in the combined output of the 4 samples per capture. The rationale is to simulate samples as if they were not barcoded and as if the DNA strands presented to the capture baits are from one sample. A random subset of 25% needs to be taken to reduce the total number of reads to a number comparable to the number of reads in the individual barcoded samples. Per pool, ten subsets were created, resulting in a total of twenty new samples. Comparing the number of regions that are completely covered at least once in the subsets and the original samples, an average of 687 and 684 additional regions (<0.3%) were covered in the subsets of pool one and two, respectively. An average of 119,097 and 131,774 additional target base pairs were covered at least once in the respective subsets, which represents 0.2% of the <53 Mb design. At a minimum sequencing depth of 5, an average of 1,705 (0.8%) and 1,468 (0.7%) additional completely covered regions and an additional 174,070 (0.3%) and 164,522 (0.3%) base pairs were covered for pool one and two, respectively. The average specificity increased from 90.85% to 91.48% and from 89.86% to 90.73% in both pools, respectively. Based on these results, we conclude that precapture pooling of samples is a valid option as the effect on the different performance parameters is minimal. The exact number of samples that can be pooled, depends on a cost-benefit assessment.
Variant calling. Finally, we also called variants using a probabilistic variant caller. The number of non-reference variants called per sample, ranged from 55,683 to 60,576 and from 62,117 to 67,890 with the ''require presence in both forward and reverse reads'' setting being applied or not, respectively (Supplementary Table  S2). Applying this setting might exclude variants at the boundaries of the targeted regions, however it decreases the amount of erroneous variants 8 .

Discussion
This study is the first to report on the performance of an exome kit designed for the dog on the latest annotation (CanFam 3.1.72). With on average 90% of the regions and 90% of the bases covered five times, a high amount of the targeted regions is captured, without too much off-target sequencing as the specificity is 90%. The reproducibility within and between captures is high. Additionally, the results indicate that pooling four samples per capture is a valid option as it has only a very limited effect on the performance, but substantially reduces the costs. This makes WES even more affordable (compared with WGS). Finally, it is demonstrated that WES is capable to detect variants within the coding regions. Overall, we have demonstrated the strong performance of this WES enrichment kit and are confident it will be a valuable tool in future disease association studies.

Methods
Sample collection. Eight blood samples were obtained from a canine blood bank available at Ghent University to study genetic disorders 9 . Approval was granted by the local ethical (Faculty of Veterinary Medicine, Ghent University, Belgium) and deontological (Federal Public Service Health, Food Chain Safety and Environment, Brussels, Belgium) committees (EC2013_193).
Design. The data needed to design the exome kit was downloaded from the University Of California Santa Cruz (UCSC) (http://genome.ucsc.edu/) table browser (Dog, CanFam3.1) 10 . From the Genes and Gene prediction tracks, RefSeq Genes and Ensembl Genes were selected. The output format was a BED file with the setting ''exons (plus 0 bases at each end)''. From the mRNA and EST Tracks, Dog mRNAs and all_mrna were selected respectively. The output format was also a BED file with the ''blocks plus 0 bases at each end'' setting. Micro RNA sequence positions were downloaded from miRBase 11 . Regions were merged using bedtools version v2.17.0. The total size of the design was 52,876,195 Mb (<2% of the genome) divided over 203,059 regions. The BED file is available on request.
Roche Nimblegen WES enrichment kit. Our design was processed by the Roche Nimblegen custom design group (Madison, USA). Using an SSAHA algorithm, capturing baits were developed based on our design and the reference genome of the dog (Canis Familiaris 3.1). Design settings for the baits allowed five or fewer singlebase insertions, deletions or substitutions between the baits and the genome. Each bait itself was only allowed to match one location in the genome to avoid too much off target sequencing. Regions under 100 bp were padded to 100 bp to increase capturing efficiency. After approval, the baits were generated and provided as SeqCap Developer Library.   Sample preparation and sequencing. Extracted DNA was fragmented on a Covaris S2 System in a 50 ml volume (aim: 300 bp fragments, settings: duty cycle: 10%, intensity: 5, cycles per burst: 200, time: 50 s). After shearing, another picogreen assay was performed. Around one mg of the fragmented DNA was used as input for the library preparation. Samples were end repaired, A-tailed and ligated with TruSeq adapters using the reagents from the NEBNext DNA Library Prep Master mix set for Illumina (New England Biolabs) according to the manufacturer's protocol. AMPure XP beads (Beckman Coulter) were used for selection of fragments with an insert size around 300 bp. One ml of the ligated product was subsequently amplified in an enrichment PCR (10 cycles) for library quality assessment as recommended in the 'SeqCap EZ Library SR User's Guide' (Nimblegen, Roche). Thereafter, the precapture LM-PCR was performed on the samples for 8 cycles as prescribed in the SeqCap EZ library protocol. The concentration of each PCR product was determined using Quant-iT TM PicogreenH dsDNA Assay (Life Technologies). Two times four samples were equimolarly pooled to obtain a total DNA input of 1250 ng. The pooled library was hybridized for 67-68 hours with the baits (SeqCap Developer Library). The hybridized library was washed and the captured and pooled DNA was recovered. After a final amplification (LM-PCR, 18 cycles), the quality of the library was checked using the High Sensitivity DNA chip (Agilent).

QPCR.
To check the fold enrichment after capturing, a qPCR is performed as a final quality control step before sequencing. We chose to test five loci. Primer one is the standard primer provided by Roche Nimblegen (NSC-0237). The other four primers were designed using NCBI Primer-BLAST (http://www.ncbi.nlm.nih.gov/tools/ primer-blast/) 12 . Sequences are available in Supplementary Table S3. The amplification efficiency of each primer was determined by qPCR. One ml of the following template DNA quantities were added to each reaction: 20 ng, 10 ng, 5 ng, 2.5 ng and 1.25 ng. Each reaction was performed in triplicate. Efficiencies E were calculated with the following formula: E 5 10 (-1/slope of standard curve) and are mentioned in Supplementary Table S3. To assess the fold enrichment for both pools of four samples, a qPCR was performed according to the instructions from Roche Nimblegen. Fold enrichment was calculated using the following formula: (E delta-Ct ) with delta-Ct being the difference in threshold cycle between the library prior and post capturing. The average fold enrichment was well over the tenfold threshold suggested by Roche Nimblegen.
Sequencing. The two pools were sequenced on two different lanes in two different runs on a HiSeq 2500 PE 100 bp.
Data-analysis. Data-analysis was performed using the CLC Genomics Workbench (Version 6.5.1, CLC Bio, Aarhus, Denmark). Data was trimmed with the following settings: ambiguous trim 5 no, quality trim 5 yes, quality limit 5 0.05, use colorspace 5 no, create report 5 yes, also search on reversed sequence 5 yes, save discarded sequences 5 yes, remove 59 terminal nucleotides 5 no, discard short reads 5 no, remove 39 terminal nucleotides 5 no, trim adapter list 5 adapter list Illumina, discard long reads 5 no, save broken pairs 5 yes. The reference genome was downloaded from the UCSC genome browser 6 . For read mapping, the following parameters were used: mismatch cost 5 2, insertion and deletion cost 5 3, length fraction: 0.5, similarity fraction 5 0.8, global alignment 5 no, autodetect paired distances 5 yes, non-specific match handling 5 ignore, output mode 5 create reads track, create report 5 yes, collect un-mapped reads 5 yes. Duplicated reads were removed with the Duplicate Mapped Reads Removal (Version 1.0 beta 5) plugin (setting: maximum representation of minority sequence (percent) to 20.0). Reads were locally realigned with the following settings: realign unaligned ends 5 yes, multi-pass realignment 5 3, guidancevariant track 5 not set, output mode 5 create reads track, output track of realigned regions 5 yes. Variants were called twice using probability variant detection with the following settings: ignore non-specific matches 5 yes, ignore broken pairs 5 yes, minimum coverage 5 5, variant probability 5 90.0, required variant count 5 2, ignore variants in non-specific regions 5 yes, filter 454/Ion homopolymer indels 5 no, maximum expected variants 5 2, genetic code 5 1 standard, create track 5 yes, create annotated table 5 yes. The first variants were called with the ''require presence in both forward and reverse reads 5 yes'', the second call was run without this setting.
Effect of pooling. Forward and reverse reads from each pool were combined in two large pools. From each pool, ten random subsets were created using seqtk version 1.0-r31. Data was analysed using the same settings as the real samples.