Introduction

Structural variations (SVs) are one of the significant types of DNA mutation and are typically distinct as larger-than-50-bp genomic alterations that include insertions, deletions, duplications, inversions, and translocations1,2. Copy number variations (CNVs) are categorized as SVs, such as insertions, duplications, and deletions that include the addition or removal of genetic material and can therefore directly affect gene product. In humans, SVs account for most nucleotide distinctions between individuals3,4. The SVs have a significant influence on genome construction and are linked to several diseases, including inherited diseases5, neurological disorders6, cancer, evolution, and gene regulation7. Understanding the genomic architecture and associated genetic elements for various disorders requires a thorough understanding of SVs and their functional implications. Moreover, single nucleotide variants (SNVs) were believed to account for human beings’ mass of genomic changes8,9. Despite their importance, SVs have received far less attention than SNVs, particularly in low-complexity areas recognized as SV hotspots4,10. Indeed, it has been demonstrated that repeats cause uncertainties in short reads, introducing faults in calling chromosomal or DNA variations11,12.

In recent years, DNA sequencing has emerged as one of the primary methods for identifying SV1,10,12. Still, Array Comparative Genomic Hybridization (aCGH) was also used to detect structural variants. Microarray technology is used in aCGH, where probes are created to cover the entire genome. Unbalanced SVs can be detected by measuring two samples' relative copy number differences. Since 2005, next-generation sequencing (NGS) has been commonly utilized in genomic exploration10,12. Recently, third-generation sequencing technologies have enabled the generation of significantly longer reads, propelling advances in variant calling and genome assembly13,14,15. In addition, the Pacific Biosciences Long Reads and the ONT have recently appeared and demonstrated their value in detecting intractable DNA sequences15,16,17.

The long-range spanning information allows for more comprehensive detection of SVs at a higher resolution10. However, short-read-based SV calling approaches have been established to distinguish SVs18,19. Most of them employ discordant read-pairs20, local assembly21, split read alignments22, read-depths18,23, or pairing of these methods24,25. These methods were applied in large-scale genomics studies26. On the other hand, these tools being designed for short reads limited their ability to apply efficient SV detection, leading to many false positive results27,28. There are two approaches for structural variant calling: De novo assembly and Read alignment-based2,29. Assembly-based approaches are much more computationally expensive than alignment-based approaches and have several problems in reconstructing large genome haplotypes19,27. Several long-read alignment-based SV callers for reads generated from PacBio and ONT, including SMRT-SV https://github.com/EichlerLab/smrtsv2 (accessed on 3 September 2023), PBSV, SVIM, Sniffles, and CuteSV, as well as newly developed SV calling tools such as SVDSS and SVcnn, have been proposed. To detect SVs, they employ various analysis methods29,30. Furthermore, cutting-edge long-read aligners like long-read aligners (LRA)31, NGMLR, Minimap2, and Pbmm2 were typically used for the read alignment. Since the ONT sequencers were newly released, SV detection have not been deeply established, leaving room for improvement; our findings provide an estimate of variation content in a human genome to date, serve as a valuable resource of SVs for other studies, and emphasize the importance of employing multiple strategies for SV discovery.

Therefore, in this study, five common SV callers, including CuteSV, Sniffles, PBSV, SVDSS, and SVIM, and four common long read aligners (minimap2, LRA, NGMLR, and pbmm2) have been evaluated using a Human Reference dataset HG002 (NA24385), HG001 (NA12878), and a simulated data SI00001 to enable the accurate evaluation of the output for the SV callers. The evaluation of five SV callers in terms of precision, recall, and F1-score statistical metrics and the four aligners according to depth of coverage, speed of analysis, and efficiency of detection and data assessment.

Materials and methods

The selection of the validation datasets for SV calling

For benchmarking the existing structural variant calling methods, it is preferable to use multiple datasets, accordingly, three datasets have been used in this evaluation workflow. The first dataset was an ONT real dataset, in FASTQ format, sequenced on PromethION and released by the GIAB consortium for the NA24385 Ashkenazim individual in (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V3.4.5/ (accessed on 3 September 2023), the Genome in a Bottle (GIAB) Consortium created benchmark SV calls and benchmark regions (https://ftp.ncbi.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz) (accessed on 3 September 2023). This “Truth set” is considered a resource of highly curated and high-quality variants and was published to the research community. SV calling methods have been released based on the hg19 coordinates. The second dataset was an ONT real dataset, in FASTQ format, sequenced on MinION using a 1D ligation kit and obtained from the Nanopore repository (https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-genome/rel34.md (accessed on 3 September 2023). The SV truth set, for this dataset, was generated by the Genome in a Bottle Consortium using the Pacific Biosciences (PacBio) platform and was used, in this manuscript, as the corresponding SV truth set for the NA12878 dataset. The analysis only included SV calls with a "PASS" flag in the "FILTER" field (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai/NA12878.sorted.vcf.gz).

The last dataset was a synthetic ONT data, referred to as SI00001, generated using the SV simulator VarIant SimulatOR (VISOR) (https://github.com/davidebolo1993/VISOR) (accessed on 3 September 2023), as per the simulation instructions to generate the ONT long reads, and was simulated to 50X coverage32. The VISOR was also used to generate an SV truth set of variants that harbor deletions insertions, duplications, and translocations. The calls reported were the ones with PASS in the FILTER field and of SV length >=50bp.

Read mapping and structural variant calling for datasets

The three datasets reads were aligned to the public human genome build GRCh37/UCSC hg19 using four long-read aligners “Minimap2”33 (v2.26), “NGMLR”34 (v.0.2.7), “LRA”31 (v1.3.7.2), and “pbmm2” https://github.com/PacificBiosciences/pbmm2 (v1.7.0) (Table 1). The reason for the alignment of the reads to the previous version of the human reference genome is that the “Benchmark set” for NA12878 and “Truth set” for NA24385, that will be later used as a benchmark reference for this evaluation process, was on the hg19. Also, the SV benchmark set simulated with VISOR was performed using the hg19 build to unify the reference genome build. After the completion of the alignment, a Sequence Alignment Map (SAM) file was generated, which was then converted to Binary Alignment Map (BAM) format using Samtools35. The resulting BAM file was sorted and indexed with Samtools to prepare the file for variant calling. Mosdepth was used to calculate the coverage after sorting and indexing the generated alignments36.

Table 1 Summary of the tools used for SV calling, annotation, and benchmarking.

In addition, the impact of different coverage depths on the SV caller's ability to identify both genomic cutoff points and genotypes has been investigated. Samtools was used to generate different coverages 30X, 20X, and 10X for better evaluation of the SV callers to achieve down-sampling for the BAM file. To perform the evaluation, four SV callers were tested in parallel on each dataset to provide an insight into the SV callers’ performance (Table 1): (1) CuteSV (v1.0.10), (2) SVIM (v.2.0.0), (3), PBSV (v2.3.0) and (4) Sniffles2 (v2.0.7). Sniffles2 is a tool that is fully integrated into a Nextflow-based workflow “Epi2me-labs/ wf-human-variation”, provided by Nanopore, that uses Minimap2 as an aligner (https://github.com/epi2me-labs/wf-human-variation)(accessed on 3 September 2023). Two newly developed SV callers, SVDSS37 and SVcnn38 were included to explore their potential as candidates for the state-of-the-art SV callers (Table 1).

Enhancing the SV calling accuracy

For enhancing the SV calling accuracy, a tandem repeat Browser Extensible Data (BED) file corresponding to the hg19 reference (https://raw.githubusercontent.com/PacificBiosciences/pbsv/master/annotations/human_hs37d5.trf.bed) (accessed on 3 September 2023), was downloaded and used during the variant calling process. Even though Sniffles, SVIM, CuteSV, and PBSV can find all kinds of SV, NpInv was designed to detect inversions accurately. Detection for Inversions (INV) was not in the scope of the current evaluation, but still, it was performed to lay the ground for the future assessment of SV callers on the level of accurate inversion detection.

Length-based binning of reported SVs

The generated VCFs for each variant caller were divided into 7 groups based on their respective lengths: (A) <50 bp, (B) 50–250 bp, (C) 251–500 bp, (D) 501–750 bp, (E) 751–1000 bp, (F) 1001–5000 bp, and (G) > 5000 bp, to get insights about the performance of the variant callers across different SV sizes.

Filtering for the SV callset

Numerous filtering was accomplished to generate comparable datasets. The SV calls from independent consensus sequences or contigs, and the mitochondrial genome was filtered out leaving only insertions, duplications, and deletions for each call set. For comparison, insertion and duplication calls were combined into one category ("insertions"). The SVs were then filtered for length >=50 bp, and only SV calls with a “PASS” flag in the “FILTER” column were filtered in for the next step of the analysis. The performance of SV detection tools was challenging to evaluate because there is no standard technique for precisely identifying SVs in the homo sapiens genome. The “Truth set”/ “Benchmark set” Variant Call Formats (VCFs) corresponding to the three datasets from GIAB and VISOR were used to address this limitation. The output VCFs of the five SV callers were then compared to this “Truth set”/”Benchmark set” VCF in terms of precision, recall, and F1-score statistical metrics using the toolkit “Truvari” (Table 1) to target the impact of sequencing settings on of the SVs generated from each tool and how close it is to the “Truth callset” where the candidate SVs missing from the truth were reflected false positives, and vice versa for false negatives.

Results

Alignment of ONT datasets using long-read aligners and corresponding truth SV call sets

For the NA24385 dataset, the GIAB consortium's ultra-long ONT FASTQ was used for the evaluation process after their retrieval from the NCBI repository. The initial total coverage was found to be 45X and was down-sampled to depths of coverage of 30X, 20X, and 10X. The truth callset has a great amount of deletions or insertions produced from various sequence lengths and visual charting for the same individual on GRCh37 genome. The NA24385 truth SV callset has 9641 SVs (with FILTER "PASS”), with 5260 insertions and 4381 deletions (Fig. 1). The FASTQ file generated by the nanopore whole-genome sequencing consortium was used for the alignment process. The reported and the calculated depth of coverage was found to be ~ 30X. Then, it was down-sampled to 20X and 10X coverage only. The SV call set is used as a corresponding created by the Genome in a Bottle Collaboration utilizing the Pacific Biosciences (PacBio) platform to generate the equivalent SV true set. There are 10,135 SVs in the NA12878 Benchmark callset (with FILTER "PASS"), with 5783 insertions and 4352 deletions (Fig. 1). The generated synthetic ONT dataset SI00001 was simulated using the SV simulator VISOR at a depth of coverage of 50X. The SV “Benchmark set” used for this dataset included 10,676 randomly generated SVs, which were then divided into 5,027 deletions and 5,027 insertions, and 300 inversions, among other types of structural variants such as duplications and translocations (Fig. 1). The SI00001 aligned bam file was down-sampled into 30X, 20X, and 10X depth of coverage. Generally, each aligner performed equally across the three datasets. In terms of time consumed, Minimap2 was the fastest of the four aligners (8 h), followed closely by LRA (14 h) and Pbmm2 (15 h), whereas NGMLR was the slowest (59 h). The alignment was done on a machine with 128 GB of RAM and 64 threads. The performance of the four aligners was represented in terms of the time taken by the tool to finish the alignment, the CPU time in hours, the wall clock, and the memory usage in gigabytes (Table 2). The metrics for the generated BAM following the four aligners were deposited into the GitHub repository (https://github.com/AnkhBioinformatics/SVcallers_Comparisons).

Figure 1
figure 1

The number distribution of Deletions (DEL) and Insertions (INS) for the NA24385 Truth set, NA12878, and SI00001 benchmark sets.

Table 2 Performance and resource consumption of Aligners regarding running time and memory usage.

Evaluation of the different SV callers’ performance in terms of precision, recall, and F-score values for SV calling of the NA24385, NA12878 and simulated SI00001 human genome datasets

The chosen four commonly used long-read sequencing SV callers (CuteSV, SVIM, Sniffles, and PBSV) were usually tested against publicly available ultra-long nanopore reads of truth set NA24385 at varying coverages. In addition to that dataset, the NA12878 and SI00001 datasets were added to enhance the power of the evaluation for the SV callers’ performance. It is worth mentioning that the SVcnn caller was previously considered for this evaluation but later rejected as it was extensively time-consuming (80 h and 27.8 GB memory) and crashed repeatedly, so it was not included in the evaluation.

All SV callers were pre-tuned to detect SV of 50 bp and above to unify the parameters for all the callers. As for the filtering of the output VCF generated from each tool, only SVs with “PASS” in the FILTER field and lay in the regions of the 1–22, X and Y chromosomes was regarded as a candidate for evaluating the results of the tools. Calls not matching any true variants are regarded as false positives. In contrast, false negatives were considered callset variants that are not present in the truth set. For combinations of the mentioned aligners and SV callers, we assessed the detected SVs' precision, recall, and F1-score. Each tool's SV calls were marked "true" or "false" according to whether they match with the matching Truth/Benchmark callset. The output of the comparison process was a report with the information generated, including the precision, recall, and F1-score of the obtained high-quality SV callsets. This helped us evaluate the quality of the SV calls for each tool as well as the performance of each tool in terms of CPU time in hours, wall clock, and memory usage in gigabytes, which is presented in Table 3.

Table 3 SV callers’ resource consumption and performance in terms of CPU time, wall clock, and memory usage.

The precision, recall, and F-score values for SV calling (Sniffles, SVIM, CuteSV, and PBSV) following Minimap2, LRA, NGMLR, and Pbmm2 alignments at different depths of coverages are displayed in Tables 4, 5 and 6; for the NA12878 (Figs. 2, 3, 4, 5), NA24385 (Figs. 6, 7, 8, 9) and simulated SI00001 (Figs. 10, 11, 12, 13) human genome datasets, respectively. The benchmarking results for the three reference datasets, combined with four different long-read aligners (Minimap2, LRA, pbmm2, and NGMLR) and four different structural variant callers (CuteSV, Sniffles, PBSV, and SVIM), revealed that the SV caller performance varies depending on the dataset and the specific SV types being analyzed. It was also revealed that the average F1 score increased with sequencing coverage, and that Sniffles and CuteSV tend to perform well across different aligners and coverage levels, followed by SVIM, PBSV, and SVDSS in last place. The CuteSV caller has the highest average F1 score (82.51%) and recall (78.50%) of the five SV callers. Also, CuteSV scored the second-highest average precision value (78.50%), showing that it can recognize a high percentage of actual SVs while minimizing false positives. The Sniffles caller closely follows the average scores of CuteSV; it has the highest average precision value (94.33%), the second-highest average F1-score (78.88%), and Recall (72.47%) of the five SV callers. The Sniffles caller may overlook actual SVs due to its lower average recall value than the CuteSV caller. In third place, after CuteSV and Sniffles, comes SVIM which has the third-highest average F1-score (75.02%) and precision (93.52%) among the five SV callers; however, the average Recall (68.10%) is lower than that of the CuteSV and Sniffles callers. The SVIM caller may overlook certain SVs but has a low false positive rate. Furthermore, PBSV has a lower average F1-score (73.55%), precision (88.30%), and recall (68.42%) than the top three SV callers. This shows that the PBSV caller may generate more false positives and overlook some actual SVs. The SVDSS caller's average F1-score (55.49%) and recall (42.28%) are the lowest of the five SV callers, suggesting it may miss a lot of actual SVs. The SVDSS caller has a high precision value (82.33%), indicating few false positives.

Table 4 The precision, recall, and F-score values for SV calling for the NA12878 sample with Sniffles, SVIM, CuteSV, PBSV and SVDSS following Alignment with the four evaluated aligners Minimap2, LRA, ngmlr and pbmm2 at different depths of coverage.
Table 5 The precision, recall, and F-score values for SV calling for the NA24385 sample with Sniffles, SVIM, CuteSV, PBSV and SVDSS following Alignment with the four evaluated aligners Minimap2, LRA, ngmlr and pbmm2 at different depths of coverage.
Table 6 The precision, recall, and F-score values for SV calling for the SI00001 sample with Sniffles, SVIM, CuteSV, PBSV and SVDSS following Alignment with the four evaluated aligners Minimap2, LRA, ngmlr and pbmm2 at different depths of coverages.
Figure 2
figure 2

The F1-score, Precision and Recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with Minimap2 aligner for the NA 12,878 dataset at different sequencing coverages.

Figure 3
figure 3

The F1-score, Precision and Recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with LRA aligner for the NA 12,878 dataset at different sequencing coverages.

Figure 4
figure 4

The F1-score, Precision and Recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with NGMLR aligner for the NA 12,878 dataset at different sequencing coverages.

Figure 5
figure 5

The F1-score, Precision and Recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with Pbmm2 aligner for the NA 12,878 dataset at different sequencing coverages.

Figure 6
figure 6

The F1-score, Precision and Recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with Minimap2 aligner for the NA24385 dataset at different sequencing coverages.

Figure 7
figure 7

The F1-score, precision and recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with LRA aligner for the NA24385 dataset at different sequencing coverages.

Figure 8
figure 8

The F1-score, precision and recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with NGMLR aligner for the NA24385 dataset at different sequencing coverages.

Figure 9
figure 9

The F1-score, precision and recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with Pbmm2 aligner for the NA24385 dataset at different sequencing coverages.

Figure 10
figure 10

The F1-score, precision and recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with Minimap2 aligner for the SI00001 dataset at different sequencing coverages.

Figure 11
figure 11

The F1-score, precision and recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with LRA aligner for the SI00001 dataset at different sequencing coverages.

Figure 12
figure 12

The F1-score, precision and recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with NGMLR aligner for the SI00001 dataset at different sequencing coverages.

Figure 13
figure 13

The F1-score, precision and recall (Y-axis) of different variant callers (Sniffles, CuteSV, SVIM, PBSV and SVDSS) with Pbmm2 aligner for the SI00001 dataset at different sequencing coverages.

On average, the CuteSV caller has a CPU time of 4.044 h, a wall clock time of 102.3 min, and a memory usage of 3.4 GB across all aligners. The CuteSV caller relies on high-quality alignments to reliably call structural variations, which may affect its performance. It performs well across aligners and uses little CPU and memory. In addition, Sniffles has a CPU time of 4.227 h, a wall clock time of 121.3 min, and a memory usage of 5.1 GB across all aligners. Like CuteSV, Sniffles tends to perform relatively well across all aligners. SVIM's CPU time was 3.445 h, wall clock time was 463.4 min, and memory use was 3.405 GB. The two-step PBSV variant calling process has an average CPU time of 11.81 h and a wall clock time of 336.1 min, with a memory usage of 56.91 GB across all aligners. It is explicitly designed for PacBio long-read data and can be computationally intensive. The three-step SVDSS variant calling process takes an average of 16.183 h on the CPU and 4:01:15 on the wall clock, and memory usage of 70.723 GB across all aligners (Table 3).

Evaluation of the different SV callers’ performance against the three datasets in terms of deletions and insertions

Each SV caller called different kinds of SVs in different numbers,, the most common types being deletions and insertions. Because only a small number of SV types other than insertions and deletions were called and some SV true sets only have insertions and deletions, the resulting SV calls from all SV callers were put into two main groups: deletions (DEL) and insertions (INS). The current evaluation did not use other types of SVs in the call sets, like inversions and translocations. The two callers, SVDSS and SVIM, consistently called a higher number of SVs than the other callers and tended to have a higher proportion of both deletions and insertions, and this may explain the F1-scores, precision, and recall values for these two tools. Sniffles and CuteSV tended to call fewer SVs than SVDSS and the SVIM. PBSV called the least number of SVs across all aligners and levels of coverage, which may be due to it being designed for analyzing PacBio long-read data. The results for using NpInv on the three datasets at different coverage degrees revealed that the number of inversions called by the NpInv tool increases with higher levels of coverage, which is expected given the increased sequencing depth and information available at higher coverage levels (Supplementary Table S1S3). The results also suggest that the choice of aligner can impact the performance of NpInv. However, the differences in performance between the aligners are relatively small, and NpInv appeared to perform well with all the aligners tested. In terms of coverage level, the highest number of inversions was called at the 30X coverage level, followed by the 20X and 10X levels. The same trend in the three datasets indicated that the degree of coverage highly impacts NpInv (Supplementary Table S4S6).

Evaluation of different SV callers’ performance in terms of SV length and their performance in terms of precision, recall and F1-score

In order to comply with the definition of a structural variant, all the SVs that were less than 50 bp were disregarded and filtered-out in the filtration step. The SV count in each group was presented in detail with the demonstration for the SV distribution across different SV length ranges in supplementary tables (Supplementary S7S9). In general, CuteSV detected a significant number of SVs in the 50–250 bp range but none in the < 50 bp range. SVIM detected a large number of SVs in the 50–250 bp range and also had substantial detection in the < 50 bp range. PBSV showed consistent detection in the 50–250 bp and 251–500 bp ranges. SVDSS had the highest total number of SVs detected, with a significant number in the < 50 bp and 50–250 bp ranges. At the Total coverage: Sniffles detected the lowest total number of SVs (< 50 bp) and the highest number of SVs in the 50–250 bp range. CuteSV detected a significant number of SVs in the 50–250 bp range but none in the < 50 bp range. SVIM detected a large number of SVs in the 50–250 bp range and also had substantial detection in the < 50 bp range. PBSV showed consistent detection in the 50–250 bp and 251–500 bp ranges. SVDSS had the highest total number of SVs detected, with a significant number in the < 50 bp and 50–250 bp ranges.

At 30X coverage: Sniffles has a high number of detected variants in the 50–250 bp range followed by 251–500 bp and 501–750 bp ranges. CuteSV detected more variants in the 50–250 bp range, with very few in other ranges. SVIM has a significant detection rate in the < 50 range, followed by the 50–250 bp range. PBSV also has most variants in the 50–250 bp range, with fewer detected as the length increases. SVDSS has a very high number in the < 50 bp range, followed by a substantial count in the 50–250 bp range. At 20X coverage: Sniffles, PBSV, CuteSV, and SVIM generally show similar patterns as seen in 30X coverage, with overall lower counts, SVDSS still remains notably high in the < 50 bp range and lower in higher ranges. At 10X coverage: Sniffles detected a significantly reduced number of variants in all ranges compared to 30X coverage. CuteSV detected fewer variants across all ranges, with zero in the < 50 bp range. SVIM detected a notably high count in the < 50 bp range with a steep drop-off in larger sizes. PBSV again shows a similar pattern with a preference towards the 50–250 bp range. SVDSS still detected a substantial number in the < 50 bp range, markedly more than other callers at this coverage (Supplementary S7S9). The distribution and the count of the detected SVs in terms of SV length groups were charted into bar charts to give insights about the performance of the different variant callers’ vs number of SVs detected per length range for NA12878 (Supplementary Figures S1S3), NA24385 (Supplementary Figures S4S7) and SI00001 (Supplementary Figures S8S11) datasets.

The accuracy metrics in terms of precision, recall and F1-score across the different SV length groups were applied to the most commonly studied reference sample NA24385 as this will be valuable towards future studies and evaluation. For Minimap2 Total Coverage: Sniffles showed varying performance across different SV length groups, with precision ranging from 47.01 to 72.80% and recall ranging from 38.14 to 77.21%. The F1-score ranged from 42.11 to 73.28%, indicating variability in its performance across different SV length categories.

CuteSV demonstrated consistently high precision, recall, and F1-score across all SV length groups, with values ranging from 82.98 to 94.73% for precision, 94.63–97.45% for recall, and 88.71–95.03% for F1-score. This indicates strong and consistent performance in detecting SVs across different length categories at this coverage level. SVIM showed varying performance, with precision ranging from 56.19 to 83.02%, recall ranging from 65.91 to 81.09%, and F1-score ranging from 60.66 to 76.24% across different SV length groups. PBSV demonstrated relatively high precision, recall, and F1-score across different SV length groups, indicating consistent performance in detecting SVs of varying lengths at this coverage level.

For Minimap2 at 30X Coverage: SVDSS demonstrated varying performance across different SV length groups, with precision ranging from 69.77 to 93.25%, recall ranging from 70.35 to 81.94%, and F1-score ranging from 70.06 to 87.23%. Sniffles showed varying performance, with precision ranging from 65.82 to 81.12%, recall ranging from 61.22 to 79.00%, and F1-score ranging from 58.77% to 77.52%. CuteSV demonstrated consistently high precision, recall, and F1-score across all SV length groups, with values ranging from 93.89 to 96.64% for precision, 95.61–97.67% for recall, and 94.74–97.15% for F1-score. SVIM showed varying performance, with precision ranging from 65.97 to 88.91%, recall ranging from 69.86 to 77.76%, and F1-score ranging from 67.86 to 82.96%. PBSV demonstrated relatively high precision, recall, and F1-score across different SV length groups, indicating consistent performance in detecting SVs of varying lengths at 30X coverage.

For Minimap2 at 20X Coverage: SVDSS demonstrated varying performance across different SV length groups, with precision ranging from 78.23 to 98.46%, recall ranging from 77.87 to 94.74%, and F1-score ranging from 78.05 to 96.57%. Sniffles showed varying performance, with precision ranging from 76.53 to 88.70%, recall ranging from 74.79 to 85.26%, and F1-score ranging from 75.65 to 84.94%. CuteSV demonstrated consistently high precision, recall, and F1-score across all SV length groups, with values ranging from 95.76 to 99.47% for precision, 96.75–97.67% for recall, and 96.25–99.20% for F1-score. SVIM showed varying performance, with precision ranging from 79.20 to 91.72%, recall ranging from 79.19 to 84.71%, and F1-score ranging from 79.20 to 88.08%. PBSV demonstrated relatively high precision, recall, and F1-score across different SV length groups, indicating consistent performance in detecting SVs of varying lengths at 20X coverage.

For Minimap2 at 10X Coverage: SVDSS demonstrated varying performance across different SV length groups, with precision ranging from 90.60 to 99.73%, recall ranging from 90.15 to 99.19%, and F1-score ranging from 90.37 to 99.46%. Sniffles showed varying performance, with precision ranging from 93.13 to 97.05%, recall ranging from 79.65 to 94.97%, and F1-score ranging from 86.09 to 95.63%. CuteSV demonstrated consistently high precision, recall, and F1-score across all SV length groups, with values ranging from 98.92 to 99.47% for precision, 99.12–98.94% for recall, and 99.02–99.20% for F1-score. SVIM showed varying performance, with precision ranging from 95.08 to 99.65%, recall ranging from 94.79 to 98.88%, and F1-score ranging from 94.93 to 99.26%. PBSV demonstrated relatively high precision, recall, and F1-score across different SV length groups, indicating consistent performance in detecting SVs of varying lengths at 10X coverage.

The SV callers’ performance with LRA, NGMLR, and Pbmm2 was the same as with Minimap2 where CuteSV demonstrated consistently high precision, recall, and F1-score across all SV length groups and coverage levels, indicating strong and consistent performance in detecting SVs. Sniffles showed varying performance across different SV length groups and coverage levels, with competitive precision and recall for detecting SVs of various lengths. SVIM demonstrated competitive performance in detecting SVs of various lengths at different coverage levels, while PBSV exhibited relatively high precision, recall, and F1-score across different SV length groups, indicating consistent performance in detecting SVs at different coverage levels. As for SVDSS, it exhibited very varying performance, with relatively low precision, recall, and F1-score (Supplementary Tables S10S13).

Discussion

Most previous studies focused on single-nucleotide polymorphisms (SNPs) detection because they are easier to track down using existing sequencing tools and algorithms39. A well thought of prevalence of SV over the last 20 years has shifted our viewpoint on its impact on genomic disorders40. Despite all these indications of SV importance, they have received far less attention than SNVs due to their difficulty in detection. In theory, each type of SV produces a distinct outline in plotting reads that can be employed to deduce the basic variations40. Multiple SVs can be overlaid or grouped together, resulting in more intricate plotting shapes than when they are viewed separately. Such complex patterns may impede mapping entirely, imposing investigators to rebuild such genomic trials and analysis from scratch27,41.

With the introduction of long-read sequencing technology, specifically Pacific Biosciences (PacBio) and ONT, it has become possible to produce reads of thousand base pairs19,29. Because of different DNA library preparations, various platforms produce diverse kinds of information42,43. As previously reported, the primary distinctions between these types of reads are their length and error rate44. Furthermore, assembly-based methods can be utilized for SV detection. It is difficult to assess the performance of SV detection tools because of the absence of a reference scheme for precisely identifying such SVs.

To address this limitation, the Genome in a Bottle (GIAB) recently released a sequence-resolved benchmark set for SV detection45. We used the long-read nanopore sequencing data results for sample NA24385 deposited in NCBI ftp to produce an accurate archetypal for the assessment of the SV detection algorithms and to create our pipeline that can help SV detection by choosing the aligner and the SV caller that fits the results of an existing benchmark set available from GIAB44,45. The NA24385 and NA12878 samples FASTQ, after their retrieval from the NCBI repository and nanopore whole-genome sequencing consortium repository, as well as the simulated dataset SI00001 FASTQ as per the instructions provided in this repository (https://github.com/davidebolo1993/EViNCe/tree/main/SI00001 (accessed on 3 September 2023) were aligned to GRCh37 reference genome using four of the most common long-read aligners Minimap2, LRA, NGMLR and Pbmm2 a SMRT wrapper for Minimap2 developed for PacBio data. To evaluate the impact of sequencing depth on SV calls, subsets were created by down-sampling of the original dataset; each dataset was achieved at 30X, 20X, and 10X sequencing coverages by using Samtools, and using Truvari, benchmarking tool, we calculated the F1 score, precision, and Recall for each of the four studied SV callers at each coverage level. We put five general-purpose SV callers to the test: Sniffles39, SVIM4,19, CuteSV30, PBSV, and SVDSS37 as they can detect all SV types from long-read alignments with an exception for the SVDSS, which was developed to detect insertions and deletions only and not yet costumed to detect inversions. Currently, ONT recommends Sniffles2 as the go-to SV caller, which was integrated as the SV caller of choice for the variant detection pipeline, along with Clair3 for SNV/Indels detection.

The Sniffles2 caller detects all types of SVs and can be used with any aligner, particularly with Minimap2. As per the recommendation of ONT, this combination was used as the base of the two Nextflow based workflows to manage compute and software resources in various workflows as previously reported46,47. After mapping reads to the reference genome, the program detects split-reads and read-pairs that span the potential SV breakpoints. Sniffles2 clusters breakpoint-spanning reads and utilizes a probabilistic algorithm to identify the most likely SV type and breakpoints39, while the CuteSV caller collects SV signatures using customized approaches and analyzes them using a clustering-and-refinement process to find sensitive SVs. The CuteSV caller outperformed state-of-the-art techniques in yield and scalability on PacBio and ONT datasets. Furthermore, the CuteSV caller uses split-read and read-pair information to detect SVs. After mapping reads to the reference genome, the tool groups split-reads and read-pairs that support SV breakpoints. The CuteSV caller then uses graphs to determine the most likely SV type and breakpoints30.

Meanwhile, SVIM calls structural variants in third-generation sequencing reads, identify, and classify most of the genetic mutations or changes by integrating genome-wide data. SVIM uses de novo assembly to generate contigs spanning potential SV breakpoints. It outperformed competing approaches on simulated and real PacBio and nanopore sequencing data. It combines split-read and read-pair information with de novo insertion event assembly to identify SVs. The SV breakpoints were identified by mapping reads to the reference genome. SVIM then generates contigs spanning these breakpoints using a de novo assembler and aligns them to the reference genome to determine the most likely SV type and breakpoints19. PBSV is a variant calling software developed by PacBio to detect structural variants in long-read PacBio sequencing data. It aligns long reads to a reference genome using a long-read aligner and identifies structural variants using split-read; discordant read pairs indicate an SV. PBSV clusters discordant read pairs and finds the most likely SV type and breakpoints using a graph-based technique. PBSV clusters these variants and filters out false positives to identify complex and large structural variants that are hard to distinguish using short-read sequencing data (PacificBiosciences/pbsv, 2022). It is the most useful SV caller for detection of insertions ranging from 20 to 10 kb, deletions ranging from 20 to 100 kb, 200 bp to 10 kb inversions, and duplications ranging from 20 to 10 kb44. On the other hand, SVIM employs a graph-based technique to discover signature clusters and final SVs, with each node representing an SV signature, and is known to perform best with PacBio HiFi reads13. The PBSV’s precision of calling the SVs was much better than the recall across the different coverage datasets. Still, overall, its Recall and Precision were much lower than those reported by other tools. However, in other studies, its performance was better than Sniffles19. This may be due to a difference in the dataset and the aligner used for benchmarking and the aligner.

SVDSS is designed to identify SVs in hard-to-call genomic regions using long-read sequencing data and sample-specific strings. SVDSS requires a FASTA format reference genome for sample genotyping. It involves building an FMD index, smoothing the input BAM file, extracting SFS, assembling SFS into superstrings, and calling SVDSS to genotype SVs. It incorporates both split-read and soft-clipping analysis, clustering, and machine learning algorithms to improve accuracy37. Regarding Inversions, Inversions are structural variations where a segment of DNA is flipped so the sequence is reversed compared to the reference genome. NpInv is the tool of choice for detecting inversions from long-read sequencing data. It works by analyzing the alignment of long-read sequencing data to a reference genome48. NpInv uses a unique approach to detect inversions; It first identifies regions where the long-read sequencing data spans two regions of the reference genome in an orientation inconsistent with the reference genome. Then, it looks for a breakpoint, which is a location where the sequence in the long-read data abruptly changes orientation. Finally, NpInv uses a statistical model to determine whether the orientation change is consistent with an inversion48. NpInv is better than other inversion detection tools, such as SVIM, Sniffles, and CuteSV in several ways. Firstly, NpInv is designed specifically for detecting inversions, whereas other tools are designed to detect a broader range of structural variations. This means that Npinv is optimized for detecting inversions and may be more sensitive and specific for this type of structural variation4,48. Secondly, NpInv is designed to work with long-read sequencing data, which is typically more informative than short-read sequencing data. Long-read sequencing data allows NpInv to span the breakpoints of inversions, which can be challenging to detect with short-read sequencing data48.

Based on the results of the performance of different SV callers with Minimap2 aligner at different coverage depths, we can see that both Sniffles and CuteSV have the highest F1-scores across all coverage depths. The PBSV caller also has a high F1-score but with lower precision. SVIM has a lower F1-score than the other callers, especially at lower coverage depths. SVDSS has the lowest F1-score, precision, and recall at all coverage depths. All callers perform relatively well at higher coverage depths (30X and 20X) with F1-scores above 90%. However, at lower coverage depths (10X), all callers except Sniffles have lower F1-scores, with SVDSS having the lowest F1-score of only 31.3%.

Regarding the performance of different SV callers with LRA aligner at different coverage depths, we see that the CuteSV caller has the highest F1-score and recall at all coverage depths. The Sniffles caller has the highest precision but lower recall compared to the CuteSV caller. SVIM performs well with an F1-score above 90% at all coverage depths. PBSV has a relatively low F1-score and recall compared to the other callers. SVDSS has the lowest F1-score, precision, and recall at all coverage depths. All callers perform relatively well at higher coverage depths (30X and 20X) with F1-scores above 75%. However, at lower coverage depths (10X), all callers except CuteSV have lower F1-scores, with SVDSS having the lowest F1-score of only 30.31%.

The performance of different SV callers with NGMLR aligner at different coverage depths shows that the CuteSV caller has the highest F1-score and recall at all coverage depths. Sniffles has the highest precision but lower recall compared to the CuteSV caller. SVIM performs well with an F1-score above 80% at all coverage depths. PBSV has a relatively low F1-score and recall compared to the other callers. SVDSS has the lowest F1-score, precision, and recall at all coverage depths. All callers perform relatively well at higher coverage depths (30X and 20X) with F1-scores above 70%. However, at lower coverage depths (10X), all callers except CuteSV have lower F1-scores, with SVDSS having the lowest F1-score of only 53.78%.

The performance of different SV callers with Pbmm2 aligner at different coverage depths shows that SVIM has the highest F1-score, precision, and recall at all coverage depths. The CuteSV caller has a relatively low F1-score at all coverage depths but still performs better than Sniffles and PBSV. SVDSS has the lowest F1-score, precision, and recall at all coverage depths. All callers perform relatively well at higher coverage depths (30X and 20X) with F1-scores above 70%. However, at lower coverage depths (10X), all callers have lower F1-scores, with SVDSS having the lowest F1-score of only 27.27%.

After analyzing the precision, recall, and F1-score data of different variant callers coupled with Minimap2, LRA, NGMLR, and Pbmm2 aligners and with respect to the SV length, several trends and patterns emerge. CuteSV consistently demonstrates high precision, recall, and F1-score across all aligners, indicating its robust performance in detecting structural variants (SVs) across different length groups and coverage levels. Sniffles exhibits competitive performance with varying precision and recall, especially for larger SVs even though this particular variant caller was a top performer when testing on the unbinned reference. SVDSS consistently shows strong performance across aligners, with relatively high precision, recall, and F1-score at each SV length group even though it showed very poor performance when testing on the unbinned reference which also lays the groud for future investigation to this behavior. SVIM demonstrates competitive performance in detecting SVs of various lengths at different coverage levels. PBSV exhibits relatively high precision, recall, and F1-score across different SV length groups, indicating consistent performance in detecting SVs. In conclusion, CuteSV emerges as a top performer across all aligners, demonstrating consistent and robust performance in detecting SVs. Sniffles shows competitive performance, especially for larger SVs. SVIM demonstrates competitive performance, while PBSV exhibits relatively high precision and recall. These findings suggest that the choice of aligner and variant caller can significantly impact the accuracy and sensitivity of SV detection.

The percentages for recall and precision fluctuate with coverages as low as 10X, indicating that low coverages should not be included in structural variations calling routines, where 20X coverage appears to be the minimum coverage required to maintain the tools' performance as determined by the F1 score. The comparison metrics results proved the usual tendencies for higher sequencing depth to increase recall and precision, though these can be disproportional depending on the tool itself. More flexible thresholds boost recall but decrease precision, whereas tougher cut-offs do the opposite. The precision and recall rates of each form of SV were studied. Each method worked best for deletions and insertions, which comprise most SVs in the human genome. Based on the results presented in the paper, both Sniffles and CuteSV consistently perform well across different aligners and coverage depths in terms of F1-score, precision, and recall. Sniffles should be preferred if high precision is required, while the CuteSV caller and Sniffles should be selected if a high recall is needed. The Minimap2 aligner and Sniffles are recommended for preliminary analysis due to their great rapidity and stable performance for both insertions and deletions.

In summary, the best-performing SV caller depends on the aligner and coverage depth used. The CuteSV caller consistently performs well across different aligners and coverage depths, with high F1-scores and recall. Sniffles has high precision, but lower recall compared to CuteSV. SVIM performs well with high F1-scores, precision, and recall at all coverage depths with Pbmm2 aligner. PBSV has a relatively low F1-score and recall compared to other callers. SVDSS consistently has the lowest F1-score, precision, and recall at all coverage depths. Researchers should select the appropriate SV caller based on their specific data and research question, considering the aligner and coverage depth used. Recently, it was proposed as a possible approach to enhance the performance of the available SV callers and syndicate reads from multiple pipelines, such as from Sniffles, CuteSV, and SVIM, which can help reduce the overall false positive rate3. Researchers should select the appropriate SV caller based on their specific data and research question, considering the aligner and coverage depth used. Moreover, various studies have investigated and evaluated the available variant calling tools for Oxford nanopore sequencing in breast cancer4,49 as well as in the metagenome discovery of various secondary metabolites of various microorganisms50,51 as well as for the detection of various plant pathogens52.

Conclusions

The current study highlights how different aligners and coverage levels affect the performance of various SV callers, with their performance varying depending on the dataset being analyzed. The choice of aligner can significantly impact the performance of structural variant (SV) callers, with Minimap2 outperforming NGMLR and LRA in recall, precision, and F1-score percentages, likely due to its ability to handle long reads. The lower coverage levels decrease SV callers’ performance due to fewer available reads. The Sniffles and CuteSV caller perform well across different aligners and coverage levels, accurately identifying various SV types. Both SVIM and PBSV perform well in some cases but have more variable performance, with SVIM having a lower recall and F1-scores and PBSV having high recall but lower precision at lower coverage levels. SVDSS consistently has the lowest F1-score, precision, and recall at all coverage depths. Based on the findings, the usage of SV callers such as the Sniffles or CuteSV are recommended for the preliminary data assessment because they achieve significant correctness, particularly upon evaluating low-coverage data. The Minimap2 as an aligner and Sniffles as an SV caller were chosen and suggested aligners as bases of the pipeline for SV calling because of their high speed and reasonable accomplishment when applying genomic mutation such as insertions and deletions. Overall, our study provides a comprehensive evaluation of popular SV callers and aligners. It can serve as a reference for researchers in selecting the most suitable tools for their SV detection needs.