Introduction

Next-generation sequencing (NGS) of targeted sets of regions of interest is one of the most widely used methods for genetic diagnostic testing1,2,3. Whole-exome sequencing (WES) platforms allow the enrichment of the entire set of human genes, offering diversity in terms of target region selection, bait length and density, the capture molecule, and the genomic fragmentation method4,5. Despite some differences in the design of target regions, all current platforms perform well6,7,8,9,10 thanks to improvements made over the past few years to enrich poorly-covered regions5,11. The performance of WES is usually evaluated according to the depth and uniformity of coverage12 because minimum site coverage of more than 10-fold13,14,15 is generally required to identify germline variants6. However, current bioinformatics pipelines for the identification of variants generate a standard variant call format (VCF) file, which reports variant sites in genomes filtered by both site coverage and mapping quality16. In order to rescue variants in well-covered regions with a low mapping quality, base calling (genotypability) calculated by combining the confidence of read alignment with the depth of coverage should be considered as a more informative parameter for the assessment of WES performance17.

There are many regions of low mapping quality in the human genome, often arising from repetitious sequences that prevent the unique alignment of short read pairs. Many genes have been duplicated over evolutionary timescales, and if the corresponding genomic regions are large enough to prevent a unique read alignment it becomes impossible to determine the source of each read18. Sequence aligners assign quality scores to read pairs according to the uniqueness of the alignment, so reads mapping to duplicated regions may gain a high quality score if one of the two read mates can be mapped unambiguously18,19.

WES library preparation protocols set the DNA fragment size to the average exon length, which is 170 bp in the human genome20,21,22. Short (<100 bp) paired-end reads are generated to avoid the overlap of read pairs, but this fragment length is often shorter than duplicated regions. Furthermore, library preparation protocols often start from very low quantities of material (nanograms to picograms)23, limiting the amount of DNA and consequently the number of unique fragments that can be produced. For this reason, 2 × 75 sequencing requires double the number of fragments to produce the expected depth of coverage that can be achieved by 2 × 150 sequencing. More amplification is therefore necessary, producing more PCR duplicates that must be removed during downstream data analysis, thus limiting the depth of coverage at target regions24.

Here we describe a new approach that increases the standard DNA fragment size, allowing the longer fragments to extend beyond exonic regions to reach introns, which are under less selection pressure than protein coding sequences but still retain conserved polymorphisms22. We anticipated that such an approach would improve the mapping quality of DNA fragments in repetitious genomic regions.

Results

Influence of DNA fragment size on duplicate and off-target rates

We assessed the performance of short (~200 bp), medium (~350 bp) and long (~500 bp) DNA fragments on four major commercial exome enrichment platforms produced by IDT, Roche, Agilent and Twist (Table 1). For each platform, libraries were generated from the genomic DNA of three unrelated individuals and were enriched according to the manufacturers’ instructions, and then sequenced on an Illumina HiSeq. 3000 instrument. Short libraries were sequenced in the 2 × 75 bp format, whereas medium and long libraries were sequenced in the 2 × 150 bp format.

Table 1 The 140X dataset.

The entire dataset (Supplementary Table S1) was normalized to a 140 theoretical X-fold coverage on the target design and the results were aggregated by the mean value of the three replicates (Table 1). The average insert sizes in the libraries prepared using the short and the medium fragments were 172–268 and 341–368 bp, respectively, whereas the long DNA fragments were often shorter than expected (398–480 bp). We evaluated the frequency of duplicates and the number of sequenced bases near and off the target obtained using different DNA fragment lengths. Short libraries generated the highest frequency of duplicates in three of the four platforms (12–15%), followed by the long libraries in two of the four platforms (9–12%). As expected, the number of sequenced near-target bases increased in all four platforms when the insert size was larger, whereas the off-target rate did not change in two of the four platforms (IDT and Twist) and declined in the other two (Roche and Agilent). The fold enrichment, which is strongly dependent on both the near-target and off-target rates, was lower in all four platforms with larger DNA fragment sizes. The differences in duplicate and off-target rates caused by different DNA fragment lengths resulted in high variability between the theoretical and mapped coverage values for each combination, as anticipated.

Influence of DNA fragment size and enrichment uniformity on genotypability

To assess the genotypability of the targets using different DNA fragment lengths, we compared base calling at uniform coverage levels for each platform. We therefore analyzed a set of downsampled BAM files with an average deduplicated X-fold coverage of 80 on each target design (Table 2). The enrichment uniformity, evaluated by applying the FOLD 80 penalty value (the fold over-coverage necessary to raise 80% of bases to the mean coverage level in those targets – https://broadinstitute.github.io/picard/picard-metric-definitions.html –), was influenced by the increase in DNA fragment size in three out of four platforms. Compared to the short DNA fragments, longer ones increased the uniformity in one platform (Twist) but reduced it in two others (IDT and Roche), suggesting DNA fragment extension had a platform-specific effect. However, the medium and long DNA fragments achieved higher base calling values in all platforms (96.58–98.23%) compared to the short DNA fragments (95.52–97.58%).

Table 2 The 80X mapped dataset.

Given the evident influence of DNA fragment extension on both enrichment uniformity and genotypability, we evaluated the single and combined effects of DNA fragment sizes and enrichment uniformity on base calling at different coverage levels by producing downsampled BAM files (with an average deduplicated X-fold coverage of 10–80) on the corresponding target designs. To assess how enrichment uniformity influenced genotypability at different coverage levels, we compared the two platforms with the best (Twist) and worst (Agilent) enrichment uniformity for the medium DNA fragments (Table 3). This revealed that the highest uniformity at 80X coverage corresponded to the best genotypability both at the standard read depth (PASS, 96.58%) and the minimum read depth of 10 (PASS10, 96.51%). The platform with the best uniformity (Twist) achieved PASS saturation at 60X coverage (96.57%), whereas the Agilent platform did not reach saturation and showed a maximum PASS value of 96.31% at 80X coverage. PASS10 values are more relevant in a clinical context, and equivalent values could be obtained for the two platforms at very different coverage levels: Twist-M = 95.77% at 40X coverage, and Agilent-M = 95.72% at 80X coverage.

Table 3 Downsampled mapped coverage for the platforms showing the best and worst FOLD 80 values at fixed DNA fragment lengths.

Next we assessed the influence of the DNA fragment size on genotypability, focusing on the platform showing the most variable enrichment uniformity. IDT showed a sharp decrease in coverage uniformity (a greater increase in the FOLD 80 penalty) when fragment size was longer than recommended by the manufacturer, jumping from 1.6 (IDT-S) to 2.28 (IDT-L) as shown in Table 4. With regard to coverage levels, IDT-L produced a greater number of over-represented regions (higher %30X values) than IDT-S at 10X and 20X mapped coverage, whereas the proportion of the target region covered by 10 or more reads at higher coverage levels (40X to 80X) was much lower for IDT-L, indicating an uneven enrichment of longer fragments. In terms of genotypability, we observed overall better PASS values with longer DNA fragments even at coverages as low as 40×. However, the beneficial effect of longer fragments on genotypability did not overcome the negative effect on enrichment uniformity, given that IDT-L did not achieve higher PASS10 values than IDT-S.

Table 4 Downsampled mapped coverage for the platform showing the highest variation of FOLD 80 values using different DNA fragment lengths.

Finally, we evaluated the combined effect of DNA fragment extension and improved enrichment uniformity on genotypability in the Twist platform, which demonstrated the most beneficial effect of longer DNA fragment sizes on the uniformity of enrichment. The comparison of Twist-S and Twist-M (Table 5) showed that genotypability improved by more than 1% when both the DNA fragment size and the enrichment uniformity increased, and this was the case for both PASS (96.58%) and PASS10 (96.51%) values. DNA fragment extension did not affect PASS saturation, which was reached at 60X for both Twist-M and Twist-S, but resulted in ~1% more genotyped bases at 80X coverage (96.58% and 95.52%, respectively). The improvement was even higher for the PASS10 values (Twist-M = 96.51%, Twist-S = 95.36%). Therefore, greater enrichment uniformity and medium-length DNA fragments produce a synergistic effect in terms of better genotypability of the target region, especially for clinically-relevant thresholds (PASS10).

Table 5 Downsampled mapped coverage for the platform showing the most beneficial effect of longer DNA fragments on FOLD 80 values.

Genotypability of RefSeq and OMIM genes

We also evaluated the genotypability of RefSeq genes using the downsampled BAM files at 80X mapped coverage on the target designs, focusing on the influence of DNA fragment size on the genotypability of each gene. The resulting dataset (Table 6) showed similar trends to those described above. The medium and long DNA fragments achieved higher genotypability (96.54–97.14%) in all platforms compared to the short fragments (95.72–96.28%).

Table 6 The 80X mapped dataset (RefSeq genes).

We then calculated the number of genes that could reach (i) 100% genotypability and (ii) any increase in genotypability as a consequence of the increase of the DNA fragment size. The medium-size DNA fragments performed best in three of the four platforms, with long libraries performing best in the Roche platform. There was a difference of 1656 genes between the best (Twist-M) and worst (Agilent-S) performing platforms (Table 7). In the first calculation, the platform with the best enrichment uniformity (Twist) reached 100% genotypability for 1107 genes by extending the DNA fragment length (Table 8), including 100 genes that improved by more than 25% in both the short-to-medium and short-to-long fragment extensions (Supplementary Fig. S1A). In the second calculation, 1993 genes showed an increase in genotypability (Table 8) due to DNA fragment extension (short-to-medium or short-to-long) including almost 200 genes that improved by more than 30% (Supplementary Fig. S1A). Overall, more than 800 RefSeq genes reached 100% genotypability in all four platforms and more than 1800 genes showed some increase in genotypability (Table 8 and Fig. 1). Only for a minimal number of the genes which could reach 100% genotypability with the short-size DNA fragments in each platform (99–312), the extension of the DNA fragment length caused a decrease in genotypability (Fig. 1).

Table 7 Number of RefSeq genes reaching 100% genotypability.
Table 8 Number of genes showing increased genotypability.
Figure 1
figure 1

RefSeq/OMIM genes reaching 100% genotypability. Number of RefSeq (a) and OMIM (b) genes reaching 100% genotypability at 80X mapped coverage on each target design using different DNA fragment lengths.

The 3873 OMIM genes associated with a clinical phenotype were analyzed as above to determine the improvement in genotypability achieved with longer DNA fragments (Table 8, Fig. 1 and Supplementary Fig. S1B). More than 150 OMIM genes reached 100% genotypability in all four platforms, and more than 280 showed some increase in genotypability. The top 20 OMIM genes ranked by improvement in genotypability as a consequence of extended DNA fragment size showed that, at equal coverage levels, the genotypability of the target region increased with DNA fragment length, from 18% to 52% (Table 9). As seen for the RefSeq dataset, a subset of the OMIM genes which could reach 100% genotypability with the short-size DNA fragments (20–80) decreased in genotypability extending the DNA fragment length (Fig. 1).

Table 9 Top 20 OMIM genes showing the best improvement in genotypability.

Finally, for each sample we determined the number of variants present in the Twist target design, the platform showing the most beneficial effect of longer DNA fragments on the uniformity of enrichment. We observed an aggregate mean increase of >1% in both the short-to-medium and short-to-long fragment extensions (Table 10). The same >1% increase with longer DNA fragments was observed for the number of variants identified in the RefSeq and OMIM genes included in the target design. These results reflect the 1% increase in genotypability achieved by increasing the length of the DNA fragments.

Table 10 Variants in the Twist target design.

Zooming into the 200–400 bp window

In order to better define the DNA fragment size that determines the highest genotypability, we selected 27 samples from our internal database of processed individuals showing DNA fragment lengths of 200, 230, 260, 270, 280, 290, 340, 360 and 400 bp. Each DNA fragment size was represented by three different individuals. All these libraries, processed with Twist kit, were sequenced in the 2 × 150 bp format, except for the 200 bp library, sequenced in the 2 × 75 bp format.

The initial dataset was normalized to a 200 theoretical X-fold coverage on the Twist target design (Supplementary Table S2) and then downsampled to an average deduplicated X-fold coverage of 80 (Table 11). It is interesting to observe the improvement of enrichment uniformity when the DNA fragment size increased from 230 to 260 bp, with the FOLD 80 dropping from 1.58 to 1.34. This improvement was maintained up to 340 bp and then slightly reduced at higher DNA fragment sizes, as also confirmed by the % of bases covered 30X (%30X) that increased from 94–95% at 200–230 bp to 99% at 260–290 and then slowly decreased at higher DNA fragment sizes. Genotypability followed the same trend, with an evident increase of PASS and PASS10 values between the 230 and 260 bp length (from 95.55% to 96.37% for PASS, and from 95.38% to 96.33% for PASS10, respectively). Moreover, at stationary values of FOLD 80 (1.34–1.37 from 260 to 340 bp), genotypability could still increase as a result of the solely DNA fragment extension: from 96.37% to 96.54% for PASS and from 96.33% to 96.49% for PASS10. It is worth noting how the improvement of enrichment uniformity reduced the gap between PASS and PASS10 values, thus improving variant calling in the clinical setting.

Table 11 The 80X mapped dataset.

Discussion

Depth of coverage is the parameter used most often to evaluate the performance of WES enrichment technologies, which are applied during the NGS of selected target regions in the genome6. However, the genotypability (base calling) of the target provides more comprehensive information, taking into account not only the depth of coverage, but also the quality of the read alignments. We evaluated changes in the genotypability of target regions caused by increasing the DNA fragment size beyond the typical length of the average exon (aimed to reduce the near-target rate) to improve the alignment of reads derived from repetitious genomic regions.

We found that longer DNA inserts increased the mapping quality of reads and thus the mappability of the target region in all four enrichment platforms, suggesting that improvements in base calling can be achieved in these platforms by introducing a measure which is more informative than the standard depth of coverage value. This was particularly evident when evaluating the genotypability of the coding sequence of RefSeq genes at fixed coverage levels (80X mapped). We observed substantial improvements in base calling for many genes, including those of clinical interest in the OMIM dataset. For example, the genotypability of genes RPL15 and RPS26 (associated with the bone marrow disorder Diamond-Blackfan anemia – OMIM 615550,613309 –) improved to 100%, from 49.98% and 47.13%, respectively. Similarly, the genotypability of RPSA (associated with the immunodeficiency disease Isolated congenital asplenia – OMIM 271400 –) improved from 63.29% to 100%, and that of the tumor suppressor gene PTEN improved from 78.55% to 100% (Table 9 and Supplementary Table S3). Oddly, a few genes showed a slight decrease in genotypability from the medium to the long fragment size, probably due to the capture probes’ placement which requires optimization for longer fragments (Supplementary Fig. S3). In most cases, longer DNA fragments overcame some of the challenges posed by repetitious genome segments, as recognized by the American College of Medical Genetics and Genomics (ACMG) in their guidelines, which recommend the development of “a strategy for detecting pathogenic variants within regions with known homology”25. Expensive long-read sequencing solutions could be adopted for genes that cannot be characterized by short-read sequencing18, but such methods have yet to be implemented in diagnostic laboratories26. Therefore, our new approach offers an alternative solution for the analysis of genes whose read mapping quality is low, although putative pathogenic variants may be present, especially in genes with the highest medical relevance27.

The number of variants identified among the RefSeq and OMIM genes included in the Twist design showed that it is possible to improve variant calling by increasing the DNA fragment length (Table 10). The 1% increase in genotypability reported above corresponded to an increase of 1% in variant calling, leading to the identification of variants in regions previously considered uncallable because of low mapping quality (Fig. 2). The greater number of variants confirms that genotypability is a better parameter for the assessment of WES and that greater mappability corresponds to a higher number of detected variants in repetitious genome regions. Such findings could be clinically significant, especially when analyzing affected patients, and hence they should be included in subsequent variant annotations to prioritize their characterization and assessment.

Figure 2
figure 2

Variant “chr13:110457271” in COL4A2 gene (RefSeq dataset). Variant called in the NA12891 sample using short-size DNA fragment (above) and medium-size DNA fragment (below). The BAM files of the samples are shown on the Twist design at 80X mapped coverage. The colour of the bar indicates the mapping quality of the read: grey = high quality mapping; white = low quality mapping.

Longer DNA inserts do not always improve the uniformity of coverage in the target region, as previously reported12, and this is another important parameter for the evaluation of enrichment efficiency during targeted NGS. Indeed, the four platforms responded differently in terms of enrichment uniformity: whereas the Roche and Agilent platforms were largely insensitive to the extension of DNA fragment length, the IDT platform showed a dramatic increase in the FOLD 80 penalty (corresponding to low enrichment uniformity), indicating that it has already been optimized for very short fragments. Interestingly, the opposite trend was apparent in the Twist platform, indicating that the already highly uniform enrichment can be improved even further. It would be interesting to determine whether this reflects the internal calibration procedure of the Twist platform or a favorable effect of the double-stranded capture probes. Generally, with high FOLD 80 penalty scores, the genotypability of the target region was directly related to the mapped coverage (higher coverage = higher genotypability), whereas higher enrichment uniformity resulted in the genotypability reaching a plateau at ~60X coverage, making deeper coverage unnecessary.

The advantages of higher enrichment uniformity include the reduction of WES costs and the need for less starting material, given that reducing the depth of sequencing also reduces the number of duplicates. Moreover, increasing the DNA fragment length also helps to overcome the problem of duplicates because for short DNA inserts the use of 2 × 75 bp reads requires the sequencing of twice as many fragments from the same amount of DNA compared to 2 × 150 bp reads. In contrast, the fold enrichment value (another important measure of enrichment efficiency based on the on-target and off-target rates) was often misleading for two reasons. First, fold enrichment depends on the definition of the target region, which in some cases is delineated by the exon boundaries but in others corresponds to a much broader area (Fig. 3). Second, fold enrichment does not provide comprehensive information about the real efficiency of the enrichment platforms, given that lower values did not correlate with a reduction in genotypability (Tables 2 and 6).

Figure 3
figure 3

Differences in BED coordinates. Genomic coordinates of exon 2 of the GZMB gene reported in the BED files provided by different enrichment platforms suppliers.

Taken together, our data show that WES performance should be based on the genotypability of the target region, which strictly depends on a combination of the DNA fragment size and the uniformity of the enrichment platform. This parameter will help clinicians to select the optimal combination of DNA insert length and enrichment platform during the design of the target region, allowing the correct interpretation of truly positive, but especially truly negative, findings. Our new approach will help to overcome current challenges caused by the presence of repetitious regions in the human genome.

Methods

Library preparation and exome capture

Genomic DNA for NA12891 and NA12892 was purchased from the Coriell Institute for Medical Research. Sample VR00 was obtained from the whole blood of an unrelated third individual, who signed an informed consent form. Samples and clinical information were made de-identified immediately after collection. All the investigations have been conducted according to the principles expressed in the Declaration of Helsinki. The analysis performed on VR00 has been approved by the “Comitato Etico per la Sperimentazione Clinica (CESC) delle province di Verona e Rovigo” ethic board. Samples NA12891, NA12892 and VR00 were processed using four different enrichment platforms: xGen Exome Research Panel V1 (IDT), SeqCap EZ MedExome (Roche), SureSelect Human All Exon V6 (Agilent), and the Human Core Exome Kit + RefSeq V1 (Twist). We produced three different DNA fragment lengths for each sample: short fragments based on the manufacturers’ recommendations (IDT = 150 bp, Roche, Agilent and Twist = 200 bp), medium fragments (expected length ~350 bp), and long fragments (expected length ~500 bp). The 27 samples provided by our internal database were processed using the Human Core Exome Kit + RefSeq V1 (Twist).

Libraries were prepared according to the manufacturers’ protocols. NA12891, NA12892 and VR00 samples (IDT = 100 ng, Roche = 500 ng, Agilent = 1500 ng, and Twist = 50 ng) apart from Twist-200 bp, which were sheared enzymatically, were sheared using a Covaris M220 ultrasonicator, adjusting the treatment time to obtain the desired DNA fragment length (Supplementary Table S4). Given the low quantity of starting material for the Twist platform, the preparation of long DNA fragments was carried out twice for each replicate and the samples were combined before size selection to produce enough DNA to ensure sufficient library complexity. The size selection was performed with Agencourt AMPure XP (Beckman Coulter) before the pre-capture PCR. The DNA fragments and libraries were characterized using a Labchip GX Touch HS Kit (Perkin Elmer) or TapeStation (Agilent) to determine the size distribution and to check for adapter contamination. The 27 DNA samples selected from our internal database were sheared enzymatically and processed adjusting the Twist protocol (Supplementary Table S5).

For samples NA12891, NA12892 and VR00, exome capture was performed independently for each combination of DNA fragment size and enrichment platform. Generally we followed the manufacturers’ protocols, but exceptions were made for the Agilent platform (single sample capture was performed) and for the preparation of long DNA fragments (the number of PCR cycles was increased by two for all platforms except Twist).

Sequencing and bioinformatics

The samples were sequenced on a HiSeq3000 instrument (Illumina) in 75 bp paired-end mode for the short libraries and in 150 bp paired-end mode for all the other libraries. An in-house bioinformatics pipeline was developed for data analysis, integrating different software as described below.

Initial FASTQ files were quality controlled using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Low quality nucleotides have been trimmed using sickle v1.33 (https://github.com/najoshi/sickle) and adaptors were removed using scythe v0.991 (https://github.com/vsbuffalo/scythe). Reads were then aligned to the reference human genome sequence (GRCh38/hg38) using BWA-MEM v0.7.15 (https://arxiv.org/abs/1303.3997). The SAM output file was converted into a sorted BAM file using SAMtools, and the BAM files were processed by local realignment around insertion–deletion sites, duplicate marking and recalibration using Genome Analysis Toolkit v3.816. Overlapping regions of the BAM file were clipped using BamUtil v1.4.14 to avoid counting multiple reads representing the same fragment. Insert sizes were calculated after read alignment, measuring the distance of the two mates mapped on the genome using CollectInsertSize by Picard v2.17.10 (http://broadinstitute.github.io/picard/).

For samples NA12891, NA12892 and VR00, from the initial dataset representing each sample we produced downsampled BAM files with a 140 theoretical X-fold coverage on the target design, subsampling the required number of fragments (calculated as: (140 * design length)/(read length * 2)) using seqtk (https://github.com/lh3/seqtk). We then produced downsampled BAM files with a 10–80X-fold mapped coverage (the maximum mapped coverage value obtained by all the platforms, generated by sub-sampling the full dataset using sambamba v0.6.7 – https://github.com/biod/sambamba –). The 27 individuals from our internal database were selected on the basis of their average DNA fragment length (200, 230, 260, 270, 280, 290, 340, 360 and 400 bp), considering 3 individuals for each size. We produced downsampled BAM files with a 200 theoretical X-fold coverage on the target design and with a 80X-fold mapped coverage (the maximum theoretical and mapped coverage values obtained by all the samples).

We then used CallableLoci in GATK v3.8 to identify callable regions of the target (genotypability), with minimum read depths of 3 and 10. These values were integrated as additional WES performance parameters for the evaluation of variant detection. CollectHsMetrics by Picard v2.17.10 was used to calculate fold enrichment and FOLD 80 penalty values to determine enrichment quality. All WES performance parameters were calculated both on the design of each platform and on the standard dataset of RefSeq genes. For each sample, near target was defined as the distance from the region of interest corresponding to the average length of the DNA fragments. Variant calling was performed using GATK v4.1.2.

Datasets

The RefSeq database (release 82) was downloaded from the UCSC Genome Table Browser (http://genome.ucsc.edu/). Online Mendelian Inheritance in Man (OMIM) genes associated with a clinical phenotype were downloaded from the OMIM website (https://www.omim.org/, release 15-05-2018).