Shedding light on dark genes: enhanced targeted resequencing by optimizing the combination of enrichment technology and DNA fragment length

Iadarola, Barbara; Xumerle, Luciano; Lavezzari, Denise; Paterno, Marta; Marcolungo, Luca; Beltrami, Cristina; Fortunati, Elisabetta; Mei, Davide; Vetro, Annalisa; Guerrini, Renzo; Parrini, Elena; Rossato, Marzia; Delledonne, Massimo

doi:10.1038/s41598-020-66331-z

Download PDF

Article
Open access
Published: 10 June 2020

Shedding light on dark genes: enhanced targeted resequencing by optimizing the combination of enrichment technology and DNA fragment length

Barbara Iadarola^1,2^na1,
Luciano Xumerle^1,2^na1,
Denise Lavezzari¹,
Marta Paterno¹,
Luca Marcolungo¹,
Cristina Beltrami¹,
Elisabetta Fortunati¹,
Davide Mei³,
Annalisa Vetro³,
Renzo Guerrini³,
Elena Parrini³,
Marzia Rossato¹ &
…
Massimo Delledonne¹

Scientific Reports volume 10, Article number: 9424 (2020) Cite this article

4126 Accesses
7 Citations
6 Altmetric
Metrics details

Subjects

Abstract

The exome contains many obscure regions difficult to explore with current short-read sequencing methods. Repetitious genomic regions prevent the unique alignment of reads, which is essential for the identification of clinically-relevant genetic variants. Long-read technologies attempt to resolve multiple-mapping regions, but they still produce many sequencing errors. Thus, a new approach is required to enlighten the obscure regions of the genome and rescue variants that would be otherwise neglected. This work aims to improve the alignment of multiple-mapping reads through the extension of the standard DNA fragment size. As Illumina can sequence fragments up to 550 bp, we tested different DNA fragment lengths using four major commercial WES platforms and found that longer DNA fragments achieved a higher genotypability. This metric, which indicates base calling calculated by combining depth of coverage with the confidence of read alignment, increased from hundreds to thousands of genes, including several associated with clinical phenotypes. While depth of coverage has been considered crucial for the assessment of WES performance, we demonstrated that genotypability has a greater impact in revealing obscure regions, with ~1% increase in variant calling in respect to shorter DNA fragments. Results confirmed that this approach enlightened many regions previously not explored.

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Nanopore sequencing technology, bioinformatics and applications

Article 08 November 2021

Introduction

Next-generation sequencing (NGS) of targeted sets of regions of interest is one of the most widely used methods for genetic diagnostic testing^1,2,3. Whole-exome sequencing (WES) platforms allow the enrichment of the entire set of human genes, offering diversity in terms of target region selection, bait length and density, the capture molecule, and the genomic fragmentation method^4,5. Despite some differences in the design of target regions, all current platforms perform well^6,7,8,9,10 thanks to improvements made over the past few years to enrich poorly-covered regions^5,11. The performance of WES is usually evaluated according to the depth and uniformity of coverage¹² because minimum site coverage of more than 10-fold^13,14,15 is generally required to identify germline variants⁶. However, current bioinformatics pipelines for the identification of variants generate a standard variant call format (VCF) file, which reports variant sites in genomes filtered by both site coverage and mapping quality¹⁶. In order to rescue variants in well-covered regions with a low mapping quality, base calling (genotypability) calculated by combining the confidence of read alignment with the depth of coverage should be considered as a more informative parameter for the assessment of WES performance¹⁷.

There are many regions of low mapping quality in the human genome, often arising from repetitious sequences that prevent the unique alignment of short read pairs. Many genes have been duplicated over evolutionary timescales, and if the corresponding genomic regions are large enough to prevent a unique read alignment it becomes impossible to determine the source of each read¹⁸. Sequence aligners assign quality scores to read pairs according to the uniqueness of the alignment, so reads mapping to duplicated regions may gain a high quality score if one of the two read mates can be mapped unambiguously^18,19.

WES library preparation protocols set the DNA fragment size to the average exon length, which is 170 bp in the human genome^20,21,22. Short (<100 bp) paired-end reads are generated to avoid the overlap of read pairs, but this fragment length is often shorter than duplicated regions. Furthermore, library preparation protocols often start from very low quantities of material (nanograms to picograms)²³, limiting the amount of DNA and consequently the number of unique fragments that can be produced. For this reason, 2 × 75 sequencing requires double the number of fragments to produce the expected depth of coverage that can be achieved by 2 × 150 sequencing. More amplification is therefore necessary, producing more PCR duplicates that must be removed during downstream data analysis, thus limiting the depth of coverage at target regions²⁴.

Here we describe a new approach that increases the standard DNA fragment size, allowing the longer fragments to extend beyond exonic regions to reach introns, which are under less selection pressure than protein coding sequences but still retain conserved polymorphisms²². We anticipated that such an approach would improve the mapping quality of DNA fragments in repetitious genomic regions.

Results

Influence of DNA fragment size on duplicate and off-target rates

We assessed the performance of short (~200 bp), medium (~350 bp) and long (~500 bp) DNA fragments on four major commercial exome enrichment platforms produced by IDT, Roche, Agilent and Twist (Table 1). For each platform, libraries were generated from the genomic DNA of three unrelated individuals and were enriched according to the manufacturers’ instructions, and then sequenced on an Illumina HiSeq. 3000 instrument. Short libraries were sequenced in the 2 × 75 bp format, whereas medium and long libraries were sequenced in the 2 × 150 bp format.

Table 1 The 140X dataset.

Full size table

The entire dataset (Supplementary Table S1) was normalized to a 140 theoretical X-fold coverage on the target design and the results were aggregated by the mean value of the three replicates (Table 1). The average insert sizes in the libraries prepared using the short and the medium fragments were 172–268 and 341–368 bp, respectively, whereas the long DNA fragments were often shorter than expected (398–480 bp). We evaluated the frequency of duplicates and the number of sequenced bases near and off the target obtained using different DNA fragment lengths. Short libraries generated the highest frequency of duplicates in three of the four platforms (12–15%), followed by the long libraries in two of the four platforms (9–12%). As expected, the number of sequenced near-target bases increased in all four platforms when the insert size was larger, whereas the off-target rate did not change in two of the four platforms (IDT and Twist) and declined in the other two (Roche and Agilent). The fold enrichment, which is strongly dependent on both the near-target and off-target rates, was lower in all four platforms with larger DNA fragment sizes. The differences in duplicate and off-target rates caused by different DNA fragment lengths resulted in high variability between the theoretical and mapped coverage values for each combination, as anticipated.

Influence of DNA fragment size and enrichment uniformity on genotypability

To assess the genotypability of the targets using different DNA fragment lengths, we compared base calling at uniform coverage levels for each platform. We therefore analyzed a set of downsampled BAM files with an average deduplicated X-fold coverage of 80 on each target design (Table 2). The enrichment uniformity, evaluated by applying the FOLD 80 penalty value (the fold over-coverage necessary to raise 80% of bases to the mean coverage level in those targets – https://broadinstitute.github.io/picard/picard-metric-definitions.html –), was influenced by the increase in DNA fragment size in three out of four platforms. Compared to the short DNA fragments, longer ones increased the uniformity in one platform (Twist) but reduced it in two others (IDT and Roche), suggesting DNA fragment extension had a platform-specific effect. However, the medium and long DNA fragments achieved higher base calling values in all platforms (96.58–98.23%) compared to the short DNA fragments (95.52–97.58%).

Table 2 The 80X mapped dataset.

Full size table

Given the evident influence of DNA fragment extension on both enrichment uniformity and genotypability, we evaluated the single and combined effects of DNA fragment sizes and enrichment uniformity on base calling at different coverage levels by producing downsampled BAM files (with an average deduplicated X-fold coverage of 10–80) on the corresponding target designs. To assess how enrichment uniformity influenced genotypability at different coverage levels, we compared the two platforms with the best (Twist) and worst (Agilent) enrichment uniformity for the medium DNA fragments (Table 3). This revealed that the highest uniformity at 80X coverage corresponded to the best genotypability both at the standard read depth (PASS, 96.58%) and the minimum read depth of 10 (PASS10, 96.51%). The platform with the best uniformity (Twist) achieved PASS saturation at 60X coverage (96.57%), whereas the Agilent platform did not reach saturation and showed a maximum PASS value of 96.31% at 80X coverage. PASS10 values are more relevant in a clinical context, and equivalent values could be obtained for the two platforms at very different coverage levels: Twist-M = 95.77% at 40X coverage, and Agilent-M = 95.72% at 80X coverage.

Table 3 Downsampled mapped coverage for the platforms showing the best and worst FOLD 80 values at fixed DNA fragment lengths.

Full size table

Next we assessed the influence of the DNA fragment size on genotypability, focusing on the platform showing the most variable enrichment uniformity. IDT showed a sharp decrease in coverage uniformity (a greater increase in the FOLD 80 penalty) when fragment size was longer than recommended by the manufacturer, jumping from 1.6 (IDT-S) to 2.28 (IDT-L) as shown in Table 4. With regard to coverage levels, IDT-L produced a greater number of over-represented regions (higher %30X values) than IDT-S at 10X and 20X mapped coverage, whereas the proportion of the target region covered by 10 or more reads at higher coverage levels (40X to 80X) was much lower for IDT-L, indicating an uneven enrichment of longer fragments. In terms of genotypability, we observed overall better PASS values with longer DNA fragments even at coverages as low as 40×. However, the beneficial effect of longer fragments on genotypability did not overcome the negative effect on enrichment uniformity, given that IDT-L did not achieve higher PASS10 values than IDT-S.

Table 4 Downsampled mapped coverage for the platform showing the highest variation of FOLD 80 values using different DNA fragment lengths.

Full size table

Finally, we evaluated the combined effect of DNA fragment extension and improved enrichment uniformity on genotypability in the Twist platform, which demonstrated the most beneficial effect of longer DNA fragment sizes on the uniformity of enrichment. The comparison of Twist-S and Twist-M (Table 5) showed that genotypability improved by more than 1% when both the DNA fragment size and the enrichment uniformity increased, and this was the case for both PASS (96.58%) and PASS10 (96.51%) values. DNA fragment extension did not affect PASS saturation, which was reached at 60X for both Twist-M and Twist-S, but resulted in ~1% more genotyped bases at 80X coverage (96.58% and 95.52%, respectively). The improvement was even higher for the PASS10 values (Twist-M = 96.51%, Twist-S = 95.36%). Therefore, greater enrichment uniformity and medium-length DNA fragments produce a synergistic effect in terms of better genotypability of the target region, especially for clinically-relevant thresholds (PASS10).

Table 5 Downsampled mapped coverage for the platform showing the most beneficial effect of longer DNA fragments on FOLD 80 values.

Full size table

Genotypability of RefSeq and OMIM genes

We also evaluated the genotypability of RefSeq genes using the downsampled BAM files at 80X mapped coverage on the target designs, focusing on the influence of DNA fragment size on the genotypability of each gene. The resulting dataset (Table 6) showed similar trends to those described above. The medium and long DNA fragments achieved higher genotypability (96.54–97.14%) in all platforms compared to the short fragments (95.72–96.28%).

Table 6 The 80X mapped dataset (RefSeq genes).

Full size table

We then calculated the number of genes that could reach (i) 100% genotypability and (ii) any increase in genotypability as a consequence of the increase of the DNA fragment size. The medium-size DNA fragments performed best in three of the four platforms, with long libraries performing best in the Roche platform. There was a difference of 1656 genes between the best (Twist-M) and worst (Agilent-S) performing platforms (Table 7). In the first calculation, the platform with the best enrichment uniformity (Twist) reached 100% genotypability for 1107 genes by extending the DNA fragment length (Table 8), including 100 genes that improved by more than 25% in both the short-to-medium and short-to-long fragment extensions (Supplementary Fig. S1A). In the second calculation, 1993 genes showed an increase in genotypability (Table 8) due to DNA fragment extension (short-to-medium or short-to-long) including almost 200 genes that improved by more than 30% (Supplementary Fig. S1A). Overall, more than 800 RefSeq genes reached 100% genotypability in all four platforms and more than 1800 genes showed some increase in genotypability (Table 8 and Fig. 1). Only for a minimal number of the genes which could reach 100% genotypability with the short-size DNA fragments in each platform (99–312), the extension of the DNA fragment length caused a decrease in genotypability (Fig. 1).

Table 7 Number of RefSeq genes reaching 100% genotypability.

Full size table

Table 8 Number of genes showing increased genotypability.

Full size table

The 3873 OMIM genes associated with a clinical phenotype were analyzed as above to determine the improvement in genotypability achieved with longer DNA fragments (Table 8, Fig. 1 and Supplementary Fig. S1B). More than 150 OMIM genes reached 100% genotypability in all four platforms, and more than 280 showed some increase in genotypability. The top 20 OMIM genes ranked by improvement in genotypability as a consequence of extended DNA fragment size showed that, at equal coverage levels, the genotypability of the target region increased with DNA fragment length, from 18% to 52% (Table 9). As seen for the RefSeq dataset, a subset of the OMIM genes which could reach 100% genotypability with the short-size DNA fragments (20–80) decreased in genotypability extending the DNA fragment length (Fig. 1).

Table 9 Top 20 OMIM genes showing the best improvement in genotypability.

Full size table

Finally, for each sample we determined the number of variants present in the Twist target design, the platform showing the most beneficial effect of longer DNA fragments on the uniformity of enrichment. We observed an aggregate mean increase of >1% in both the short-to-medium and short-to-long fragment extensions (Table 10). The same >1% increase with longer DNA fragments was observed for the number of variants identified in the RefSeq and OMIM genes included in the target design. These results reflect the 1% increase in genotypability achieved by increasing the length of the DNA fragments.

Table 10 Variants in the Twist target design.

Full size table

Zooming into the 200–400 bp window

In order to better define the DNA fragment size that determines the highest genotypability, we selected 27 samples from our internal database of processed individuals showing DNA fragment lengths of 200, 230, 260, 270, 280, 290, 340, 360 and 400 bp. Each DNA fragment size was represented by three different individuals. All these libraries, processed with Twist kit, were sequenced in the 2 × 150 bp format, except for the 200 bp library, sequenced in the 2 × 75 bp format.

The initial dataset was normalized to a 200 theoretical X-fold coverage on the Twist target design (Supplementary Table S2) and then downsampled to an average deduplicated X-fold coverage of 80 (Table 11). It is interesting to observe the improvement of enrichment uniformity when the DNA fragment size increased from 230 to 260 bp, with the FOLD 80 dropping from 1.58 to 1.34. This improvement was maintained up to 340 bp and then slightly reduced at higher DNA fragment sizes, as also confirmed by the % of bases covered 30X (%30X) that increased from 94–95% at 200–230 bp to 99% at 260–290 and then slowly decreased at higher DNA fragment sizes. Genotypability followed the same trend, with an evident increase of PASS and PASS10 values between the 230 and 260 bp length (from 95.55% to 96.37% for PASS, and from 95.38% to 96.33% for PASS10, respectively). Moreover, at stationary values of FOLD 80 (1.34–1.37 from 260 to 340 bp), genotypability could still increase as a result of the solely DNA fragment extension: from 96.37% to 96.54% for PASS and from 96.33% to 96.49% for PASS10. It is worth noting how the improvement of enrichment uniformity reduced the gap between PASS and PASS10 values, thus improving variant calling in the clinical setting.

Table 11 The 80X mapped dataset.

Full size table

Discussion

Depth of coverage is the parameter used most often to evaluate the performance of WES enrichment technologies, which are applied during the NGS of selected target regions in the genome⁶. However, the genotypability (base calling) of the target provides more comprehensive information, taking into account not only the depth of coverage, but also the quality of the read alignments. We evaluated changes in the genotypability of target regions caused by increasing the DNA fragment size beyond the typical length of the average exon (aimed to reduce the near-target rate) to improve the alignment of reads derived from repetitious genomic regions.

We found that longer DNA inserts increased the mapping quality of reads and thus the mappability of the target region in all four enrichment platforms, suggesting that improvements in base calling can be achieved in these platforms by introducing a measure which is more informative than the standard depth of coverage value. This was particularly evident when evaluating the genotypability of the coding sequence of RefSeq genes at fixed coverage levels (80X mapped). We observed substantial improvements in base calling for many genes, including those of clinical interest in the OMIM dataset. For example, the genotypability of genes RPL15 and RPS26 (associated with the bone marrow disorder Diamond-Blackfan anemia – OMIM 615550,613309 –) improved to 100%, from 49.98% and 47.13%, respectively. Similarly, the genotypability of RPSA (associated with the immunodeficiency disease Isolated congenital asplenia – OMIM 271400 –) improved from 63.29% to 100%, and that of the tumor suppressor gene PTEN improved from 78.55% to 100% (Table 9 and Supplementary Table S3). Oddly, a few genes showed a slight decrease in genotypability from the medium to the long fragment size, probably due to the capture probes’ placement which requires optimization for longer fragments (Supplementary Fig. S3). In most cases, longer DNA fragments overcame some of the challenges posed by repetitious genome segments, as recognized by the American College of Medical Genetics and Genomics (ACMG) in their guidelines, which recommend the development of “a strategy for detecting pathogenic variants within regions with known homology”²⁵. Expensive long-read sequencing solutions could be adopted for genes that cannot be characterized by short-read sequencing¹⁸, but such methods have yet to be implemented in diagnostic laboratories²⁶. Therefore, our new approach offers an alternative solution for the analysis of genes whose read mapping quality is low, although putative pathogenic variants may be present, especially in genes with the highest medical relevance²⁷.

The number of variants identified among the RefSeq and OMIM genes included in the Twist design showed that it is possible to improve variant calling by increasing the DNA fragment length (Table 10). The 1% increase in genotypability reported above corresponded to an increase of 1% in variant calling, leading to the identification of variants in regions previously considered uncallable because of low mapping quality (Fig. 2). The greater number of variants confirms that genotypability is a better parameter for the assessment of WES and that greater mappability corresponds to a higher number of detected variants in repetitious genome regions. Such findings could be clinically significant, especially when analyzing affected patients, and hence they should be included in subsequent variant annotations to prioritize their characterization and assessment.

Longer DNA inserts do not always improve the uniformity of coverage in the target region, as previously reported¹², and this is another important parameter for the evaluation of enrichment efficiency during targeted NGS. Indeed, the four platforms responded differently in terms of enrichment uniformity: whereas the Roche and Agilent platforms were largely insensitive to the extension of DNA fragment length, the IDT platform showed a dramatic increase in the FOLD 80 penalty (corresponding to low enrichment uniformity), indicating that it has already been optimized for very short fragments. Interestingly, the opposite trend was apparent in the Twist platform, indicating that the already highly uniform enrichment can be improved even further. It would be interesting to determine whether this reflects the internal calibration procedure of the Twist platform or a favorable effect of the double-stranded capture probes. Generally, with high FOLD 80 penalty scores, the genotypability of the target region was directly related to the mapped coverage (higher coverage = higher genotypability), whereas higher enrichment uniformity resulted in the genotypability reaching a plateau at ~60X coverage, making deeper coverage unnecessary.

The advantages of higher enrichment uniformity include the reduction of WES costs and the need for less starting material, given that reducing the depth of sequencing also reduces the number of duplicates. Moreover, increasing the DNA fragment length also helps to overcome the problem of duplicates because for short DNA inserts the use of 2 × 75 bp reads requires the sequencing of twice as many fragments from the same amount of DNA compared to 2 × 150 bp reads. In contrast, the fold enrichment value (another important measure of enrichment efficiency based on the on-target and off-target rates) was often misleading for two reasons. First, fold enrichment depends on the definition of the target region, which in some cases is delineated by the exon boundaries but in others corresponds to a much broader area (Fig. 3). Second, fold enrichment does not provide comprehensive information about the real efficiency of the enrichment platforms, given that lower values did not correlate with a reduction in genotypability (Tables 2 and 6).

Taken together, our data show that WES performance should be based on the genotypability of the target region, which strictly depends on a combination of the DNA fragment size and the uniformity of the enrichment platform. This parameter will help clinicians to select the optimal combination of DNA insert length and enrichment platform during the design of the target region, allowing the correct interpretation of truly positive, but especially truly negative, findings. Our new approach will help to overcome current challenges caused by the presence of repetitious regions in the human genome.

Methods

Library preparation and exome capture

Genomic DNA for NA12891 and NA12892 was purchased from the Coriell Institute for Medical Research. Sample VR00 was obtained from the whole blood of an unrelated third individual, who signed an informed consent form. Samples and clinical information were made de-identified immediately after collection. All the investigations have been conducted according to the principles expressed in the Declaration of Helsinki. The analysis performed on VR00 has been approved by the “Comitato Etico per la Sperimentazione Clinica (CESC) delle province di Verona e Rovigo” ethic board. Samples NA12891, NA12892 and VR00 were processed using four different enrichment platforms: xGen Exome Research Panel V1 (IDT), SeqCap EZ MedExome (Roche), SureSelect Human All Exon V6 (Agilent), and the Human Core Exome Kit + RefSeq V1 (Twist). We produced three different DNA fragment lengths for each sample: short fragments based on the manufacturers’ recommendations (IDT = 150 bp, Roche, Agilent and Twist = 200 bp), medium fragments (expected length ~350 bp), and long fragments (expected length ~500 bp). The 27 samples provided by our internal database were processed using the Human Core Exome Kit + RefSeq V1 (Twist).

Libraries were prepared according to the manufacturers’ protocols. NA12891, NA12892 and VR00 samples (IDT = 100 ng, Roche = 500 ng, Agilent = 1500 ng, and Twist = 50 ng) apart from Twist-200 bp, which were sheared enzymatically, were sheared using a Covaris M220 ultrasonicator, adjusting the treatment time to obtain the desired DNA fragment length (Supplementary Table S4). Given the low quantity of starting material for the Twist platform, the preparation of long DNA fragments was carried out twice for each replicate and the samples were combined before size selection to produce enough DNA to ensure sufficient library complexity. The size selection was performed with Agencourt AMPure XP (Beckman Coulter) before the pre-capture PCR. The DNA fragments and libraries were characterized using a Labchip GX Touch HS Kit (Perkin Elmer) or TapeStation (Agilent) to determine the size distribution and to check for adapter contamination. The 27 DNA samples selected from our internal database were sheared enzymatically and processed adjusting the Twist protocol (Supplementary Table S5).

For samples NA12891, NA12892 and VR00, exome capture was performed independently for each combination of DNA fragment size and enrichment platform. Generally we followed the manufacturers’ protocols, but exceptions were made for the Agilent platform (single sample capture was performed) and for the preparation of long DNA fragments (the number of PCR cycles was increased by two for all platforms except Twist).

Sequencing and bioinformatics

The samples were sequenced on a HiSeq3000 instrument (Illumina) in 75 bp paired-end mode for the short libraries and in 150 bp paired-end mode for all the other libraries. An in-house bioinformatics pipeline was developed for data analysis, integrating different software as described below.

Initial FASTQ files were quality controlled using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Low quality nucleotides have been trimmed using sickle v1.33 (https://github.com/najoshi/sickle) and adaptors were removed using scythe v0.991 (https://github.com/vsbuffalo/scythe). Reads were then aligned to the reference human genome sequence (GRCh38/hg38) using BWA-MEM v0.7.15 (https://arxiv.org/abs/1303.3997). The SAM output file was converted into a sorted BAM file using SAMtools, and the BAM files were processed by local realignment around insertion–deletion sites, duplicate marking and recalibration using Genome Analysis Toolkit v3.8¹⁶. Overlapping regions of the BAM file were clipped using BamUtil v1.4.14 to avoid counting multiple reads representing the same fragment. Insert sizes were calculated after read alignment, measuring the distance of the two mates mapped on the genome using CollectInsertSize by Picard v2.17.10 (http://broadinstitute.github.io/picard/).

For samples NA12891, NA12892 and VR00, from the initial dataset representing each sample we produced downsampled BAM files with a 140 theoretical X-fold coverage on the target design, subsampling the required number of fragments (calculated as: (140 * design length)/(read length * 2)) using seqtk (https://github.com/lh3/seqtk). We then produced downsampled BAM files with a 10–80X-fold mapped coverage (the maximum mapped coverage value obtained by all the platforms, generated by sub-sampling the full dataset using sambamba v0.6.7 – https://github.com/biod/sambamba –). The 27 individuals from our internal database were selected on the basis of their average DNA fragment length (200, 230, 260, 270, 280, 290, 340, 360 and 400 bp), considering 3 individuals for each size. We produced downsampled BAM files with a 200 theoretical X-fold coverage on the target design and with a 80X-fold mapped coverage (the maximum theoretical and mapped coverage values obtained by all the samples).

We then used CallableLoci in GATK v3.8 to identify callable regions of the target (genotypability), with minimum read depths of 3 and 10. These values were integrated as additional WES performance parameters for the evaluation of variant detection. CollectHsMetrics by Picard v2.17.10 was used to calculate fold enrichment and FOLD 80 penalty values to determine enrichment quality. All WES performance parameters were calculated both on the design of each platform and on the standard dataset of RefSeq genes. For each sample, near target was defined as the distance from the region of interest corresponding to the average length of the DNA fragments. Variant calling was performed using GATK v4.1.2.

Datasets

The RefSeq database (release 82) was downloaded from the UCSC Genome Table Browser (http://genome.ucsc.edu/). Online Mendelian Inheritance in Man (OMIM) genes associated with a clinical phenotype were downloaded from the OMIM website (https://www.omim.org/, release 15-05-2018).

Data availability

Deposited data generated during the current study are available for download at our public repository at the link: http://ddlab.sci.univr.it/files/iadarola_et_al/iadarola_et_al.tar.gz (VCF files with associated BED files of callable regions) and at the Sequence Read Archive (SRA) repository under study accession number: SRP253353 (FASTQ files).

References

Rabbani, B., Tekin, M. & Mahdieh, N. The promise of whole-exome sequencing in medical genetics. J. Hum. Genet. 59, 5–15 (2014).
Article CAS Google Scholar
Sun, Y. et al. Next-Generation Diagnostics: Gene Panel, Exome, or Whole Genome? Hum. Mutat. 36, 648–655 (2015).
Article CAS Google Scholar
Metzker, M. L. Sequencing technologies the next generation. Nat. Rev. Genet. 11, 31–46 (2010).
Article CAS Google Scholar
Ku, C. S., Cooper, D. N. & Patrinos, G. P. The Rise and Rise of Exome Sequencing. Public Health Genomics 19, 315–324 (2017).
Article Google Scholar
Shigemizu, D. et al. Performance comparison of four commercial human whole-exome capture platforms. Sci. Rep. 5, 1–8 (2015).
Article Google Scholar
Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: Key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).
Article CAS Google Scholar
Clark, M. J. et al. Performance comparison of exome DNA sequencing technologies. Nat. Biotechnol. 29, 908–916 (2011).
Article CAS Google Scholar
García-García, G. et al. Assessment of the latest NGS enrichment capture methods in clinical context. Sci. Rep. 6, 1–8 (2016).
Article Google Scholar
Bodi, K. et al. Comparison of commercially available target enrichment methods for next-generation sequencing. J. Biomol. Tech. 24, 73–86 (2013).
Article CAS Google Scholar
Mertes, F. et al. Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief. Funct. Genomics 10, 374–386 (2011).
Article CAS Google Scholar
Meienberg, J. et al. New insights into the performance of human whole-exome capture platforms. Nucleic Acids Res. 43 (2015).
Pommerenke, C. et al. Enhanced whole exome sequencing by higher DNA insert lengths. BMC Genomics 17, 1–8 (2016).
Article Google Scholar
Choi, M. et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl. Acad. Sci. USA 106, 19096–19101 (2009).
Article ADS CAS Google Scholar
Wang, Q., Shashikant, C. S., Jensen, M., Altman, N. S. & Girirajan, S. Novel metrics to measure coverage in whole exome sequencing datasets reveal local and global non-uniformity. Sci. Rep. 7, 1–11 (2017).
Article Google Scholar
Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).
Article ADS CAS Google Scholar
Van der Auwera, G. A. et al. From fastQ data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Current Protocols in Bioinformatics, https://doi.org/10.1002/0471250953.bi1110s43 (2013).
Ferrarini, A. et al. The use of non-variant sites to improve the clinical assessment of whole-genome sequence data. PLoS One 10, 1–15 (2015).
Article Google Scholar
Ebbert, M. T. W. et al. Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome Biol. 20, 1–23 (2019).
Article Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Ballester, L. Y., Luthra, R., Kanagal-Shamanna, R. & Singh, R. R. Advances in clinical next-generation sequencing: Target enrichment and sequencing technologies. Expert Rev. Mol. Diagn. 16, 357–372 (2016).
Article CAS Google Scholar
Sakharkar, M. K., Chow, V. T. K. & Kangueane, P. Distributions of exons and introns in the human genome. In Silico Biol. 4, 387–393 (2004).
CAS PubMed Google Scholar
Gudlaugsdottir, S., Boswell, D. R., Wood, G. R. & Ma, J. Exon size distribution and the origin of introns. Genetica 131, 299–306 (2007).
Article CAS Google Scholar
Head, S. R. et al. Library construction for next-generation sequencing: Overviews and challenges. Biotechniques 56, 61–77 (2014).
Article CAS Google Scholar
Ebbert, M. T. W. et al. Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics 17, (2016).
Rehm, H. L. et al. ACMG clinical laboratory standards for next-generation sequencing. Genet. Med. 15, 733–747 (2013).
Article Google Scholar
Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: A resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).
Article CAS Google Scholar
Kalia, S. S. et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): A policy statement of the American College of Medical Genetics and Genomics. Genet. Med. 19, 249–255 (2017).
Article Google Scholar

Download references

Acknowledgements

This study was supported by the European Commission Horizon2020 (Marie Curie-Skłodowska Innovative Training Network “PANINI”, Grant agreement 675003. We gratefully acknowledge the Centro Piattaforme Tecnologiche (CPT) for granting access to the genomic facility of University of Verona, and MSc student Andrei Florea for contributing to the bioinformatics pipeline implemented for sequence data analysis.

Author information

These authors contributed equally: Barbara Iadarola and Luciano Xumerle.

Authors and Affiliations

Department of Biotechnology, University of Verona, Strada Le Grazie 15, 37134, Verona, Italy
Barbara Iadarola, Luciano Xumerle, Denise Lavezzari, Marta Paterno, Luca Marcolungo, Cristina Beltrami, Elisabetta Fortunati, Marzia Rossato & Massimo Delledonne
Personal Genomics s.r.l, Via Roveggia 43B, 37136, Verona, Italy
Barbara Iadarola & Luciano Xumerle
Pediatric Neurology, Neurogenetics and Neurobiology Unit and Laboratories, Department of Neuroscience, A. Meyer Children’s Hospital, University of Florence, viale Pieraccini 24, 50139, Florence, Italy
Davide Mei, Annalisa Vetro, Renzo Guerrini & Elena Parrini

Authors

Barbara Iadarola
View author publications
You can also search for this author in PubMed Google Scholar
Luciano Xumerle
View author publications
You can also search for this author in PubMed Google Scholar
Denise Lavezzari
View author publications
You can also search for this author in PubMed Google Scholar
Marta Paterno
View author publications
You can also search for this author in PubMed Google Scholar
Luca Marcolungo
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Beltrami
View author publications
You can also search for this author in PubMed Google Scholar
Elisabetta Fortunati
View author publications
You can also search for this author in PubMed Google Scholar
Davide Mei
View author publications
You can also search for this author in PubMed Google Scholar
Annalisa Vetro
View author publications
You can also search for this author in PubMed Google Scholar
Renzo Guerrini
View author publications
You can also search for this author in PubMed Google Scholar
Elena Parrini
View author publications
You can also search for this author in PubMed Google Scholar
Marzia Rossato
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Delledonne
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, M.D.; methodology, B.I., L.X., L.M., M.P., C.B. and M.D.; resources, E.F., E.P. and R.G.; software, B.I., L.X., D.L. and L.M.; validation, A.V. and D.M.; formal analysis, B.I., L.X. and D.L; investigation, M.P., C.B. and E.F.; writing—original draft, B.I. and M.D.; writing—review and editing, L.X., D.L., L.M. and M.D.; supervision, M.R. and M.D.; funding acquisition, M.R. and M.D.

Corresponding author

Correspondence to Massimo Delledonne.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Iadarola, B., Xumerle, L., Lavezzari, D. et al. Shedding light on dark genes: enhanced targeted resequencing by optimizing the combination of enrichment technology and DNA fragment length. Sci Rep 10, 9424 (2020). https://doi.org/10.1038/s41598-020-66331-z

Download citation

Received: 13 December 2019
Accepted: 23 April 2020
Published: 10 June 2020
DOI: https://doi.org/10.1038/s41598-020-66331-z

This article is cited by

Twist exome capture allows for lower average sequence coverage in clinical exome sequencing
- Burcu Yaldiz
- Erdi Kucuk
- Patrick May
Human Genomics (2023)
‘Nebbiolo’ genome assembly allows surveying the occurrence and functional implications of genomic structural variations in grapevines (Vitis vinifera L.)
- Simone Maestri
- Giorgio Gambino
- Luciano Calderón
BMC Genomics (2022)
Performance comparison of four types of target enrichment baits for exome DNA sequencing
- Juan Zhou
- Mancang Zhang
- Yongyong Shi
Hereditas (2021)
Whole-exome sequencing of the mummified remains of Cangrande della Scala (1291–1329 CE) indicates the first known case of late-onset Pompe disease
- Barbara Iadarola
- Denise Lavezzari
- Massimo Delledonne
Scientific Reports (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Influence of DNA fragment size on duplicate and off-target rates

Influence of DNA fragment size and enrichment uniformity on genotypability

Genotypability of RefSeq and OMIM genes

Zooming into the 200–400 bp window

Discussion

Methods

Library preparation and exome capture

Sequencing and bioinformatics

Datasets

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links