Shedding light on dark genes: enhanced targeted resequencing by optimizing the combination of enrichment technology and DNA fragment length

The exome contains many obscure regions difficult to explore with current short-read sequencing methods. Repetitious genomic regions prevent the unique alignment of reads, which is essential for the identification of clinically-relevant genetic variants. Long-read technologies attempt to resolve multiple-mapping regions, but they still produce many sequencing errors. Thus, a new approach is required to enlighten the obscure regions of the genome and rescue variants that would be otherwise neglected. This work aims to improve the alignment of multiple-mapping reads through the extension of the standard DNA fragment size. As Illumina can sequence fragments up to 550 bp, we tested different DNA fragment lengths using four major commercial WES platforms and found that longer DNA fragments achieved a higher genotypability. This metric, which indicates base calling calculated by combining depth of coverage with the confidence of read alignment, increased from hundreds to thousands of genes, including several associated with clinical phenotypes. While depth of coverage has been considered crucial for the assessment of WES performance, we demonstrated that genotypability has a greater impact in revealing obscure regions, with ~1% increase in variant calling in respect to shorter DNA fragments. Results confirmed that this approach enlightened many regions previously not explored.

timescales, and if the corresponding genomic regions are large enough to prevent a unique read alignment it becomes impossible to determine the source of each read 18 . Sequence aligners assign quality scores to read pairs according to the uniqueness of the alignment, so reads mapping to duplicated regions may gain a high quality score if one of the two read mates can be mapped unambiguously 18,19 .
WES library preparation protocols set the DNA fragment size to the average exon length, which is 170 bp in the human genome [20][21][22] . Short (<100 bp) paired-end reads are generated to avoid the overlap of read pairs, but this fragment length is often shorter than duplicated regions. Furthermore, library preparation protocols often start from very low quantities of material (nanograms to picograms) 23 , limiting the amount of DNA and consequently the number of unique fragments that can be produced. For this reason, 2 × 75 sequencing requires double the number of fragments to produce the expected depth of coverage that can be achieved by 2 × 150 sequencing. More amplification is therefore necessary, producing more PCR duplicates that must be removed during downstream data analysis, thus limiting the depth of coverage at target regions 24 .
Here we describe a new approach that increases the standard DNA fragment size, allowing the longer fragments to extend beyond exonic regions to reach introns, which are under less selection pressure than protein coding sequences but still retain conserved polymorphisms 22 . We anticipated that such an approach would improve the mapping quality of DNA fragments in repetitious genomic regions.

Results
Influence of DNA fragment size on duplicate and off-target rates. We assessed the performance of short (~200 bp), medium (~350 bp) and long (~500 bp) DNA fragments on four major commercial exome enrichment platforms produced by IDT, Roche, Agilent and Twist (Table 1). For each platform, libraries were generated from the genomic DNA of three unrelated individuals and were enriched according to the manufacturers' instructions, and then sequenced on an Illumina HiSeq. 3000 instrument. Short libraries were sequenced in the 2 × 75 bp format, whereas medium and long libraries were sequenced in the 2 × 150 bp format.
The entire dataset (Supplementary Table S1) was normalized to a 140 theoretical X-fold coverage on the target design and the results were aggregated by the mean value of the three replicates ( Table 1). The average insert sizes in the libraries prepared using the short and the medium fragments were 172-268 and 341-368 bp, respectively, whereas the long DNA fragments were often shorter than expected (398-480 bp). We evaluated the frequency of duplicates and the number of sequenced bases near and off the target obtained using different DNA fragment lengths. Short libraries generated the highest frequency of duplicates in three of the four platforms (12-15%), followed by the long libraries in two of the four platforms (9-12%). As expected, the number of sequenced near-target bases increased in all four platforms when the insert size was larger, whereas the off-target rate did not change in two of the four platforms (IDT and Twist) and declined in the other two (Roche and Agilent). The fold enrichment, which is strongly dependent on both the near-target and off-target rates, was lower in all four platforms with larger DNA fragment sizes. The differences in duplicate and off-target rates caused by different DNA fragment lengths resulted in high variability between the theoretical and mapped coverage values for each combination, as anticipated.
Influence of DNA fragment size and enrichment uniformity on genotypability. To assess the genotypability of the targets using different DNA fragment lengths, we compared base calling at uniform coverage levels for each platform. We therefore analyzed a set of downsampled BAM files with an average deduplicated X-fold coverage of 80 on each target design ( Given the evident influence of DNA fragment extension on both enrichment uniformity and genotypability, we evaluated the single and combined effects of DNA fragment sizes and enrichment uniformity on base calling at different coverage levels by producing downsampled BAM files (with an average deduplicated X-fold coverage of 10-80) on the corresponding target designs. To assess how enrichment uniformity influenced genotypability at different coverage levels, we compared the two platforms with the best (Twist) and worst (Agilent) enrichment uniformity for the medium DNA fragments (Table 3). This revealed that the highest uniformity at 80X coverage corresponded to the best genotypability both at the standard read depth (PASS, 96.58%) and the minimum   www.nature.com/scientificreports www.nature.com/scientificreports/ read depth of 10 (PASS10, 96.51%). The platform with the best uniformity (Twist) achieved PASS saturation at 60X coverage (96.57%), whereas the Agilent platform did not reach saturation and showed a maximum PASS value of 96.31% at 80X coverage. PASS10 values are more relevant in a clinical context, and equivalent values could be obtained for the two platforms at very different coverage levels: Twist-M = 95.77% at 40X coverage, and Agilent-M = 95.72% at 80X coverage.
Next we assessed the influence of the DNA fragment size on genotypability, focusing on the platform showing the most variable enrichment uniformity. IDT showed a sharp decrease in coverage uniformity (a greater increase in the FOLD 80 penalty) when fragment size was longer than recommended by the manufacturer, jumping from 1.6 (IDT-S) to 2.28 (IDT-L) as shown in Table 4. With regard to coverage levels, IDT-L produced a greater number of over-represented regions (higher %30X values) than IDT-S at 10X and 20X mapped coverage, whereas the proportion of the target region covered by 10 or more reads at higher coverage levels (40X to 80X) was much lower for IDT-L, indicating an uneven enrichment of longer fragments. In terms of genotypability, we observed overall better PASS values with longer DNA fragments even at coverages as low as 40×. However, the beneficial effect of longer fragments on genotypability did not overcome the negative effect on enrichment uniformity, given that IDT-L did not achieve higher PASS10 values than IDT-S.
Finally, we evaluated the combined effect of DNA fragment extension and improved enrichment uniformity on genotypability in the Twist platform, which demonstrated the most beneficial effect of longer DNA fragment sizes on the uniformity of enrichment. The comparison of Twist-S and Twist-M (Table 5) showed that genotypability improved by more than 1% when both the DNA fragment size and the enrichment uniformity increased, and this was the case for both PASS (96.58%) and PASS10 (96.51%) values. DNA fragment extension did not affect PASS saturation, which was reached at 60X for both Twist-M and Twist-S, but resulted in ~1% more genotyped bases at 80X coverage (96.58% and 95.52%, respectively). The improvement was even higher for the PASS10 values (Twist-M = 96.51%, Twist-S = 95.36%). Therefore, greater enrichment uniformity and medium-length DNA fragments produce a synergistic effect in terms of better genotypability of the target region, especially for clinically-relevant thresholds (PASS10).

Genotypability of RefSeq and OMIM genes.
We also evaluated the genotypability of RefSeq genes using the downsampled BAM files at 80X mapped coverage on the target designs, focusing on the influence of DNA fragment size on the genotypability of each gene. The resulting dataset (Table 6) showed similar trends to those described above. The medium and long DNA fragments achieved higher genotypability (96.54-97.14%) in all platforms compared to the short fragments (95.72-96.28%).
We then calculated the number of genes that could reach (i) 100% genotypability and (ii) any increase in genotypability as a consequence of the increase of the DNA fragment size. The medium-size DNA fragments performed best in three of the four platforms, with long libraries performing best in the Roche platform. There was a difference of 1656 genes between the best (Twist-M) and worst (Agilent-S) performing platforms (Table 7). In the first calculation, the platform with the best enrichment uniformity (Twist) reached 100% genotypability for 1107 genes by extending the DNA fragment length (     www.nature.com/scientificreports www.nature.com/scientificreports/ than 25% in both the short-to-medium and short-to-long fragment extensions ( Supplementary Fig. S1A). In the second calculation, 1993 genes showed an increase in genotypability (Table 8) due to DNA fragment extension (short-to-medium or short-to-long) including almost 200 genes that improved by more than 30% ( Supplementary  Fig. S1A). Overall, more than 800 RefSeq genes reached 100% genotypability in all four platforms and more than 1800 genes showed some increase in genotypability (Table 8 and Fig. 1). Only for a minimal number of the genes which could reach 100% genotypability with the short-size DNA fragments in each platform (99-312), the extension of the DNA fragment length caused a decrease in genotypability (Fig. 1).
The 3873 OMIM genes associated with a clinical phenotype were analyzed as above to determine the improvement in genotypability achieved with longer DNA fragments (Table 8, Fig. 1 and Supplementary Fig. S1B). More than 150 OMIM genes reached 100% genotypability in all four platforms, and more than 280 showed some increase in genotypability. The top 20 OMIM genes ranked by improvement in genotypability as a consequence of extended DNA fragment size showed that, at equal coverage levels, the genotypability of the target region increased with DNA fragment length, from 18% to 52% (Table 9). As seen for the RefSeq dataset, a subset of the OMIM genes which could reach 100% genotypability with the short-size DNA fragments (20-80) decreased in genotypability extending the DNA fragment length (Fig. 1).
Finally, for each sample we determined the number of variants present in the Twist target design, the platform showing the most beneficial effect of longer DNA fragments on the uniformity of enrichment. We observed an aggregate mean increase of >1% in both the short-to-medium and short-to-long fragment extensions (Table 10). The same >1% increase with longer DNA fragments was observed for the number of variants identified in the RefSeq and OMIM genes included in the target design. These results reflect the 1% increase in genotypability achieved by increasing the length of the DNA fragments.
Zooming into the 200-400 bp window. In    www.nature.com/scientificreports www.nature.com/scientificreports/ by three different individuals. All these libraries, processed with Twist kit, were sequenced in the 2 × 150 bp format, except for the 200 bp library, sequenced in the 2 × 75 bp format.
The initial dataset was normalized to a 200 theoretical X-fold coverage on the Twist target design (Supplementary Table S2) and then downsampled to an average deduplicated X-fold coverage of 80 (Table 11). It is interesting to observe the improvement of enrichment uniformity when the DNA fragment size increased from 230 to 260 bp, with the FOLD 80 dropping from 1.58 to 1.34. This improvement was maintained up to 340 bp and then slightly reduced at higher DNA fragment sizes, as also confirmed by the % of bases covered 30X (%30X) that increased from 94-95% at 200-230 bp to 99% at 260-290 and then slowly decreased at higher DNA fragment sizes. Genotypability followed the same trend, with an evident increase of PASS

Discussion
Depth of coverage is the parameter used most often to evaluate the performance of WES enrichment technologies, which are applied during the NGS of selected target regions in the genome 6 . However, the genotypability (base calling) of the target provides more comprehensive information, taking into account not only the depth of coverage, but also the quality of the read alignments. We evaluated changes in the genotypability of target regions caused by increasing the DNA fragment size beyond the typical length of the average exon (aimed to reduce the near-target rate) to improve the alignment of reads derived from repetitious genomic regions.
We found that longer DNA inserts increased the mapping quality of reads and thus the mappability of the target region in all four enrichment platforms, suggesting that improvements in base calling can be achieved in   Table 10. Variants in the Twist target design. Total number of variants identified in the Twist target design, and in the corresponding RefSeq and OMIM genes, for each DNA fragment size (S = short, M = medium, L = long).
these platforms by introducing a measure which is more informative than the standard depth of coverage value. This was particularly evident when evaluating the genotypability of the coding sequence of RefSeq genes at fixed coverage levels (80X mapped). We observed substantial improvements in base calling for many genes, including those of clinical interest in the OMIM dataset. For example, the genotypability of genes RPL15 and RPS26 (associated with the bone marrow disorder Diamond-Blackfan anemia -OMIM 615550,613309 -) improved to 100%, from 49.98% and 47.13%, respectively. Similarly, the genotypability of RPSA (associated with the immunodeficiency disease Isolated congenital asplenia -OMIM 271400 -) improved from 63.29% to 100%, and that of the tumor suppressor gene PTEN improved from 78.55% to 100% (Table 9 and Supplementary Table S3). Oddly, a few genes showed a slight decrease in genotypability from the medium to the long fragment size, probably due to the capture probes' placement which requires optimization for longer fragments (Supplementary Fig. S3). In most cases, longer DNA fragments overcame some of the challenges posed by repetitious genome segments, as recognized by the American College of Medical Genetics and Genomics (ACMG) in their guidelines, which recommend the development of "a strategy for detecting pathogenic variants within regions with known homology" 25 . Expensive long-read sequencing solutions could be adopted for genes that cannot be characterized by short-read sequencing 18 , but such methods have yet to be implemented in diagnostic laboratories 26 . Therefore, our new approach offers an alternative solution for the analysis of genes whose read mapping quality is low, although putative pathogenic variants may be present, especially in genes with the highest medical relevance 27 . The number of variants identified among the RefSeq and OMIM genes included in the Twist design showed that it is possible to improve variant calling by increasing the DNA fragment length (Table 10). The 1% increase in genotypability reported above corresponded to an increase of 1% in variant calling, leading to the identification of variants in regions previously considered uncallable because of low mapping quality (Fig. 2). The greater number of variants confirms that genotypability is a better parameter for the assessment of WES and that greater mappability corresponds to a higher number of detected variants in repetitious genome regions. Such findings could be clinically significant, especially when analyzing affected patients, and hence they should be included in subsequent variant annotations to prioritize their characterization and assessment.
Longer DNA inserts do not always improve the uniformity of coverage in the target region, as previously reported 12 , and this is another important parameter for the evaluation of enrichment efficiency during targeted NGS. Indeed, the four platforms responded differently in terms of enrichment uniformity: whereas the Roche and Agilent platforms were largely insensitive to the extension of DNA fragment length, the IDT platform showed a dramatic increase in the FOLD 80 penalty (corresponding to low enrichment uniformity), indicating that it has already been optimized for very short fragments. Interestingly, the opposite trend was apparent in the Twist platform, indicating that the already highly uniform enrichment can be improved even further. It would be interesting to determine whether this reflects the internal calibration procedure of the Twist platform or a favorable effect of the double-stranded capture probes. Generally, with high FOLD 80 penalty scores, the genotypability of the target region was directly related to the mapped coverage (higher coverage = higher genotypability), whereas higher enrichment uniformity resulted in the genotypability reaching a plateau at ~60X coverage, making deeper coverage unnecessary.
The advantages of higher enrichment uniformity include the reduction of WES costs and the need for less starting material, given that reducing the depth of sequencing also reduces the number of duplicates. Moreover, increasing the DNA fragment length also helps to overcome the problem of duplicates because for short DNA inserts the use of 2 × 75 bp reads requires the sequencing of twice as many fragments from the same amount of DNA compared to 2 × 150 bp reads. In contrast, the fold enrichment value (another important measure of enrichment efficiency based on the on-target and off-target rates) was often misleading for two reasons. First, fold enrichment depends on the definition of the target region, which in some cases is delineated by the exon boundaries but in others corresponds to a much broader area (Fig. 3). Second, fold enrichment does not provide comprehensive information about the real efficiency of the enrichment platforms, given that lower values did not correlate with a reduction in genotypability (Tables 2 and 6). Taken together, our data show that WES performance should be based on the genotypability of the target region, which strictly depends on a combination of the DNA fragment size and the uniformity of the enrichment platform. This parameter will help clinicians to select the optimal combination of DNA insert length and enrichment platform during the design of the target region, allowing the correct interpretation of truly positive, but especially truly negative, findings. Our new approach will help to overcome current challenges caused by the presence of repetitious regions in the human genome.

Methods
Library preparation and exome capture. Genomic DNA for NA12891 and NA12892 was purchased from the Coriell Institute for Medical Research. Sample VR00 was obtained from the whole blood of an unrelated third individual, who signed an informed consent form. Samples and clinical information were made de-identified immediately after collection. All the investigations have been conducted according to the principles expressed in the Declaration of Helsinki. The analysis performed on VR00 has been approved by the "Comitato Etico per la Sperimentazione Clinica (CESC) delle province di Verona e Rovigo" ethic board. Samples NA12891, NA12892 and VR00 were processed using four different enrichment platforms: xGen Exome Research Panel V1 (IDT), SeqCap EZ MedExome (Roche), SureSelect Human All Exon V6 (Agilent), and the Human Core Exome Kit + RefSeq V1 (Twist). We produced three different DNA fragment lengths for each sample: short fragments based on the manufacturers' recommendations (IDT = 150 bp, Roche, Agilent and Twist = 200 bp), medium fragments (expected length ~350 bp), and long fragments (expected length ~500 bp). The 27 samples provided by our internal database were processed using the Human Core Exome Kit + RefSeq V1 (Twist).
Libraries were prepared according to the manufacturers' protocols. NA12891, NA12892 and VR00 samples (IDT = 100 ng, Roche = 500 ng, Agilent = 1500 ng, and Twist = 50 ng) apart from Twist-200 bp, which were sheared enzymatically, were sheared using a Covaris M220 ultrasonicator, adjusting the treatment time to   Table S4). Given the low quantity of starting material for the Twist platform, the preparation of long DNA fragments was carried out twice for each replicate and the samples were combined before size selection to produce enough DNA to ensure sufficient library complexity. The size selection was performed with Agencourt AMPure XP (Beckman Coulter) before the pre-capture PCR. The DNA fragments and libraries were characterized using a Labchip GX Touch HS Kit (Perkin Elmer) or TapeStation (Agilent) to determine the size distribution and to check for adapter contamination. The 27 DNA samples selected from our internal database were sheared enzymatically and processed adjusting the Twist protocol (Supplementary Table S5).
For samples NA12891, NA12892 and VR00, exome capture was performed independently for each combination of DNA fragment size and enrichment platform. Generally we followed the manufacturers' protocols, but exceptions were made for the Agilent platform (single sample capture was performed) and for the preparation of long DNA fragments (the number of PCR cycles was increased by two for all platforms except Twist).
Sequencing and bioinformatics. The samples were sequenced on a HiSeq3000 instrument (Illumina) in 75 bp paired-end mode for the short libraries and in 150 bp paired-end mode for all the other libraries. An in-house bioinformatics pipeline was developed for data analysis, integrating different software as described below.
For samples NA12891, NA12892 and VR00, from the initial dataset representing each sample we produced downsampled BAM files with a 140 theoretical X-fold coverage on the target design, subsampling the required number of fragments (calculated as: (140 * design length)/(read length * 2)) using seqtk (https://github.com/ lh3/seqtk). We then produced downsampled BAM files with a 10-80X-fold mapped coverage (the maximum mapped coverage value obtained by all the platforms, generated by sub-sampling the full dataset using sambamba v0.6.7 -https://github.com/biod/sambamba -). The 27 individuals from our internal database were selected on the basis of their average DNA fragment length (200, 230, 260, 270, 280, 290, 340, 360 and 400 bp), considering 3 individuals for each size. We produced downsampled BAM files with a 200 theoretical X-fold coverage on the target design and with a 80X-fold mapped coverage (the maximum theoretical and mapped coverage values obtained by all the samples).
We then used CallableLoci in GATK v3.8 to identify callable regions of the target (genotypability), with minimum read depths of 3 and 10. These values were integrated as additional WES performance parameters for the evaluation of variant detection. CollectHsMetrics by Picard v2.17.10 was used to calculate fold enrichment and FOLD 80 penalty values to determine enrichment quality. All WES performance parameters were calculated both on the design of each platform and on the standard dataset of RefSeq genes. For each sample, near target was defined as the distance from the region of interest corresponding to the average length of the DNA fragments. Variant calling was performed using GATK v4.1.2.

Data availability
Deposited data generated during the current study are available for download at our public repository at the link: http://ddlab.sci.univr.it/files/iadarola_et_al/iadarola_et_al.tar.gz (VCF files with associated BED files of callable regions) and at the Sequence Read Archive (SRA) repository under study accession number: SRP253353 (FASTQ files).