Introduction

The last decade represents an unprecedented era of genetic research; tremendous amounts of genomic data are being generated, together with a parallel development in computational power and advancements of bioinformatics algorithms to decipher genomic patterns in these data. The time- and cost-effectiveness of the next-generation sequence-by-synthesis technology, made it the most widely used sequencing technology in research and clinical investigations. Most of today’s bioinformatics algorithms are, therefore, designed to analyze short-read sequencing data.

Following successful genetic discoveries using short-read sequencing, the research community started to face new obstacles due to the inability of short-read sequencing to effectively resolve specific characteristics of the human genome. These obstacles can be summed into: (a) inability of short-reads to accurately map onto complex parts of the genome [1, 2], (b) the need for very complex algorithms, which in turn require expensive computational power, to accurately identify structural variants (SVs), (c) despite all the advancements in bioinformatics, some quantitative analyses like copy number variations (CNVs), are still hard to be accurately identified and assessed using short reads. The fact that parts of the human genome are still yet to be fully constructed in the reference genome, is another representation of the need for longer sequences to understand the complexity of genomic sequences.

The emergence of Next-Next-Generation sequencing technologies by PacBio (Pacific Biosciences Inc., Menlo Park, CA, USA) single molecule real-time (SMRT) technology [3], and Oxford Nanopore (Oxford Nanopore Technologies Ltd., Oxford Science Park, Oxford, UK) long sequencing technologies [4], brought new opportunities for genetic researchers to overcome the shortcomings of short-read sequencing. However, these long-read sequencing technologies still have their own limitations, represented mainly by inaccuracies at the base-by-base level. These errors are mainly due to low signal to the noise ratio [5]; in addition, studies showed that the single nucleotide errors of SMRT long-read sequencing can be partially attributed to base substitution errors of polymerase enzyme [6, 7] with random distribution across long reads. Short insertion and deletion errors (indels), represent the majority of SMRT errors, with a tendency to occur around homopolymer regions and can also be the result of polymerization slowdown around non B-form DNA conformations, like G-quadruplexes [8]. Nonetheless, these errors are still random in nature and as the number of polymerization passes increases, the resulting consensus sequence accuracy would increase; this is being exploited by the circular consensus sequencing (CCS) [9] and the very recently developed high fidelity (HiFi) sequences [10].

The higher error rate of long-read sequencing, in comparison to the short-read one [5], has led scientists to resort to long-read only when trying to fathom genetic research muddles that involve a complex part of the genome, or structurally challenging for short-read to handle efficiently. At the same time, the short-read sequencing is still the technology of choice for identifying single nucleotide variants (SNVs) and short insertions and deletions (indels). Therefore, bioinformaticians started to find ways of hybridizing the results of both short- and long-read data, to get reliable genomic sequences by exploiting the lower error rate of short-reads in combination with the length of long-reads, which are long enough to accurately map to complex parts of the reference genome, to identify SVs, CNVs and repetitive regions.

Several studies tried to measure error rates of short-read sequencing. A study by Nakamura et al. [11] was among the first to describe specific systematic errors produced by illumina sequencers. Despite the following development in illumina’s technologies, short-read sequences, nonetheless, are still suffering from systematic errors unequivocally associated with specific base-sequences. Pfeiffer et al. [12] performed a systematic evaluation of error rates, and they determined the error rate to be 0.24 ± 0.06% per base and 6.4 ± 1.24% of the reads are mutated for illumina’s short-read sequencing technology.

Nanopore sequencing errors were shown to have some systematic patterns and less random than PacBio’s sequencing errors [5], however, despite the higher frequency of errors in long reads, the extended length of the PacBio’s and Nanopore’s reads still provide more randomness of errors-per-read in comparison to short reads.

In this study, we compare SNV detection in three cases that have had whole-genome sequencing using both, illumina’s short-read sequencing, and PacBio’s SMRT sequencing technologies. The comparison was done using genotyping of the mitochondrial DNA (mtDNA) rather than the nuclear one for the following reasons: (a) compared to nuclear DNA, the number of mtDNA copies inside a cell is tremendously high, exceeding the nuclear one by thousands of folds in some cells, therefore, it naturally provides higher depth of coverage for any sample’s whole-genome sequencing which is necessary for accurately comparing variant allele frequencies (VAF) in reads; (b) haploid-phasing of variants in nuclear DNA is necessary for obtaining a higher recall rate, as has been described in multiple studies [13,14,15,16]; therefore, being a haploid DNA, mtDNA makes variants identification more comparable between short and long sequences; (c) mtDNA is a very short sequence of DNA compared to any nuclear chromosome, therefore, its reassembly against the reference is more accurate compared to nuclear DNA, for identifying baseline heteroplasmy fractions.

Synthetic long reads, generated by technologies like 10X Genomics’ bar coding (Pleasenton, CA, USA) [17] can provide chromosomal reads that are exponentially longer than mtDNA with high confidence of being from the same DNA fragment. However, this study aims to make a direct comparison between the standard output of long-read sequencing and short-read sequencing technologies, without resorting to costly and sophisticated technologies in avoiding the haplotype mix-up.

Materials and methods

Sample selection

Three samples of unrelated individuals were selected in our laboratory, where both short- and long-read whole-genome sequencing analyses were done for each sample. Sample-1 is of an 8-year-old female who was diagnosed with Krabbe disease (OMIM# 245200), she has beta-galactocerebrosidase deficiency and a heterozygous mutation. No mitochondrial variants were found to be responsible for her clinical diagnosis. Sample-2 is of a 40-year-old female who was diagnosed with benign adult familial myoclonus epilepsy (BAFME) (OMIM# 601068), she is referred to as individual [III 2] in a BAFME family that was studied by Mizuguchi et al. [18]. Sample-3 is of a 31-year-old male patient with definite hereditary hemorrhagic telangiectasia (HHT) (OMIM# 187300), based on the Curaçao’s diagnostic criteria [19]. His three-generation family history is suggestive of autosomal inheritance of HHT and no mitochondrial variants were found to be possibly linked to his HHT diagnosis.

Long-read library preparation

Genomic DNA of the three samples was extracted from peripheral blood leukocytes using QuickGene (Kurabo) for samples 1 and 3, and standard phenol-chloroform for sample 2. DNA size and integrity were assessed using pulse-field agarose gel electrophoresis, followed by DNA concentration measurement using Qubit fluorometer (Life Technologies). Fragmentation, using g-TUBE (covaris) and 1500 × g centrifugation, was done before purifying the fragmented DNA by AMpure PB magnetic beads (Beckman Coulter).

Five micrograms of each sample’s fragmented DNA was utilized for SMRTbell library reparation using SMRTbell Template Prep Kit 1.0 SPv3, Sequel Binding Kit 2.0, SMRTbell Clean-Up Column v2 Kit, and MagBead Kit v2 (Pacific Biosciences). Briefly, the resultant SMRTbell template was enriched for DNA fragments of >10 kb via BluePippin (Sage Science) size-selection. Purification of the size-selected was done using AMpure PB before performing DNA repair reaction. SMRTbell template DNA was annealed with Sequel Polymerase 2.0. The Clean-up Column kit was used to purify the SMRTbell template DNA/polymerase complex, before the diluting the purified complex to a concentration of 20 pM. Finally, the purified complex was mixed with MegaBead to produce MegaBead-bout SMRTbell complex which was loaded onto Sequel SMRT Cell 1M v2. A total of four cells were used for samples 2 and 3 and six cells were used for sample 1, with a data collection time of 6 h for each SMRT cell.

Short-read library preparation

Genomic DNA was extracted from peripheral blood lymphocytes. Using TruSeq DNA PCR-free library preparation kit, genomic DNA library was constructed before sequencing with illumina’s HiSeqX10, using single index. Generated sequence data had an average of 32.8 million of 150 nucleotide-long paired-end reads for each sample.

Long-read mitochondrial DNA data analysis

For the purpose of comparison, we kept the consistency of analysis with that of short-read by performing the long-read analysis on whole genome single-pass subreads obtained from PacBio Sequel sequencer. PacBio’s single-pass subreads were generated by obtaining long sequences from the SMRTbell templates, after the removal of adapter sequences. Each sample’s subreads BAM file contains all the subreads generated from all cells used for a specific sample. In addition to the nucleotide-sequence information, a full set of quality and kinetic parameters are attached to each subread; therefore, for a full utilization of these technology-specific data, the mapping and analysis were done using the standard software included in PacBio’s SMRT tools v.6.0.0 (Pacific Biosciences).

Subreads produced by cells of each sample were aligned to mtDNA rCRS reference (NC_012920.1), using BLASR (v5.1) [20] with default mapping options.

Following alignment to the rCRS reference genome, the average N50 length of polymerase reads for the whole-genome data was 14,761 bp, and the average number of subreads per sample was 378, with an average length of 3906 bp (Supplementary Table 1). The average concordance of samples’ data with the reference is 0.8261.

Short-read data analysis

The Short-read data analysis of mtDNA was done following best-practice guidelines of Genome Analysis Toolkit (GATK v.4.1) [21] (Broad Institute), since GATK is still regarded as the gold standard and one of the most widely used software toolkit for genotyping short-read data. In version 4.1 of GATK, the Mutect2 tool, which was primarily designed to call somatic short nuclear variants using local assembly of haplotypes, has been revised to include the “mitochondria mode,” where the LOD score is set to 0 for the capability to annotate possible nuclear mitochondrial sequences using Poisson distribution of the median autosomal coverage. Therefore, utilizing Mutect2 can provide a robust detection of very low fractions of mitochondrial variants after a statistical exclusion of “nuclear mitochondrial DNA segments” (NuMT) which represent transposed mitochondrial sequences in the nuclear DNA. In addition, Mutect2 utilizes the original DREAM challenge-winning engine [22], together with the HaplotypeCaller machinery of local de novo reassembly. Therefore, Mutect2 can provide high sensitivity in combination with specificity in calling variants of mtDNA.

Following the GATK best-practice guidelines, short-reads were mapped to GRCh38 genome reference that includes the revised Cambridge Reference mitochondrial Sequence (NC_012920.1) [23], using Burrows-Wheeler Alignment Tool (bwa v0.7.17-r1188) [24]. Since bwa aligner is not designed to evenly align circular reads of the mtDNA, the alignment process included two branches where the second branch was aligned to the mtDNA reference that was shifted by 8000 nucleotides. Following alignment, the resultant two BAM files for each sample were passed through a pipeline of Genomic Analysis Toolkit (GATK v.4) [21] tools that included duplicate reads marking, local indel realignment, and quality scores recalibration, before being genotyped using Mutect2 in mitochondrial mode.

Long-read data analysis

The genotyping of long-read mapped data were done using variantCaller tool (v2.2.2) of PacBio’s SMRTtools (v6.0). variantCaller is provided by PacBio in the GenomicConsensus package; when it runs in default settings, as we have done for our samples, it utilizes the Arrow consensus model for variant calling against the reference. Arrow algorithm is an improved model of Quiver [25] which is based on hidden Markov principle that utilizes the consensus data of long reads to filter out random errors.

Haplogroup assignment

The haplogroup assignment for each sample was done using mitolib v.0.1.2 software (https://github.com/haansi/mitolib) integrated into the contamination analysis step of GATK 4.1 tools [21]. The mitolib’s haplockecker checks for mtDNA contamination using Phylotree 17 and assigns the most probable haplotype for each mtDNA short-read BAM file.

Sanger sequencing

A total of seven discrepant variants between both, short- and long-read sequencing analyses, were chosen for Sanger sequencing. All variants that have VAF below 0.1, except for one variant in sample-1 that has a borderline VAF of 0.096, were excluded from Sanger sequencing confirmation. Sequences of primers used are available upon request.

Following standard PCR amplification, and capillary electrophoresis using ABI 3130xl, we had to modify the PCR protocol to accommodate for a GC-rich region.

Tagging variants

Several studies concluded the high likelihood for illumina’s short-read variants with VAF below 1% to be erroneous [26, 27]; in fact, most of these reads in our study had multi heteroplasmic variants which were annotated by Mutect2 as “chimeric original alignment” or as “strand artifacts”. It is also important to mention that a number of studies describe specific mtDNA variants associated with illumina’s short-read sequencing as sequencing artifacts [26, 28]. Generally speaking, there are two main sources of errors: (a) technology-specific systematic errors, as with variants flanked by low-complexity regions [29]; (b) bioinformatics errors, as in the miscalling of variants around the “N” placeholder at 3107 position of the rCRS reference. Therefore, for an unbiased comparison between the long- and short-read sequencing technologies, we tagged all variants that belong to any of the aforementioned categories as likely erroneous variants. These variants lie in the following positions of mtDNA, where it is characterized by low complexity sequences or the place holder: (301, 302, 310, 316, 3107, and 16182–16192).

According to a comprehensive study by Spencer et al. [30]. In order to accurately detect variants with VAFs less than 0.01, specialized library preparation methods are required; otherwise, variants called using common methods with VAFs less than 1% are very likely to be erroneous. Therefore, following the short-read genotyping, variants with VAF less than 0.01 were removed before proceeding with further analyses.

Statistical analysis for interrater reliability

Due to the small number of samples being analyzed, and the necessity to mask tagged variants, the utilization of standard statistical analyses, like t-test and chi-square analyses, becomes inapplicable. We can assume, however, that long- and short-read genotyping analyses are two raters for each sample where: (a) each sample’s mtDNA result is independent from other sample’s, since they are not related individuals; (b) the probability for each position in a sample’s mtDNA to be mutated is independent from other positions’, and it can be either identical to the reference or not, with no preference for the rater to report each position to be identical to the reference or not, in other words, each position in a sample’s DNA has a mutually exclusive probability of being mutated or not, and being identical to the reference or not is independent among different positions; (c) the raters, short- and long-read technologies, are operating independently from each other.

According to the proposed conditions by Jacob Cohen in 1960 [31] weighted Cohen’s kappa statistics represents the best statistical model for comparing the agreement between long- and short-read data analyses. Since both technologies cannot provide accurate genotyping at the masked positions, it became necessary to adopt weighted Kappa’s statistical analysis, since in these masked regions, no technology was proven to be superior to another, and giving these masked regions’ position a third status of ‘unknown’ can safely evaluate the agreement between the two raters by providing low weight to these masked regions and reduce the effect of their obscurity on the analysis. If we were to choose another more familiar statistical measurements like t-test, we would need larger number of cases in order to have a statistically significant analysis of the agreement between the two tests.

The formula for Cohen’s kappa [31] calculation is:

$$\kappa = \frac{{P\left( a \right) - P\left( e \right)}}{{1 - P\left( e \right)}}$$

Where P(a) is the accuracy, or the actual agreement between the two raters, and P(e) is the estimated or hypothetical probability of agreement between the two raters.

In our study, we implemented the quadratic weighted Cohen’s kappa [32] calculation, which takes the three possibilities at each position: reference, mutated, unknown, as independent variables. Therefore, the disagreement between the two raters are treated equally, with the different levels of agreements contribute to the value of kappa.

The formula for our quadratic weights will therefore be:

$$w_i = 1 - \frac{{\left( {LongRead\;value - ShortRead\;value} \right)^2}}{{\left( {Total\;number\;of\;categories - 1} \right)^2}}$$

Since we have a total of three categories:

$$w_i = 1 - \frac{{\left( {LongRead\;value - ShortRead\;value} \right)^2}}{4}$$

Where wi is the weighted agreement score at position i. Values are 1 for reference, 2 for mutation, and 3 for unknown.

Additionally, since we are analyzing the data for three samples only, Cohen’s weighted kappa scoring reduces bias through: (a) the consideration of each position of the mtDNA as a separate experiment, for both technologies to analyze, providing a stronger statistical power for the calculation of kappa coefficient; (b) the random errors of SMRT long-read sequencing, can be accounted for in the coefficient calculation for each position of the mtDNA; since we are doing the calculation in respect to the total number of possible values, which is three in our case; (c) the ability to include the masked regions’ variants in the calculation, using a different weight of disagreement.

It’s important to mention that the kappa coefficient is not a directly interpretable measure of agreement [33], but rather an indication of the level of agreement. Kappa coefficient values above 0.81 represent an almost perfect level of agreement with a reliability of data between 64 and 100% (see Supplementary Table 2 for description of all levels).

In calculating kappa coefficient for each sample, we used the standard weighted kappa coefficient tool of the specialized python library scikit-learn [34] by comparing long- and short-read analyzed data at each position of the 16569 mtDNA reference.

Results

Short-read mitochondrial DNA variants analysis

The average total number of variants for short reads is 39.3, with averages of 35.67 SNVs and 3.67 indel variants per-sample. For the called variants in each sample, the average maximum and minimum coverage values are 4984 and 1037, respectively (Table 1).

Table 1 Summary statistics of short- and long-read sequencing genotyping for mitochondrial DNA of three samples

Long-read mitochondrial DNA variants analysis

The average total number of variants for long reads is 36.67, with averages of 34.3 SNVs and 2.3 indel variants per sample. For the called variants in each sample, the average maximum and minimum coverage values are 228.3 and 49.3, respectively (Table 1).

The average total number of variants per sample is very comparable between the two technologies (Fig. 1a, c, e), despite the disproportional difference of coverage. However, four untagged variants genotyped using short-read analyses with VAFs ranging from 0.032 to 0.096, are not detected with the long-read analysis.

Fig. 1
figure 1

Circular plotting of genotyped variants and downsampling genotyping. Results of sample-1 (a, b), sample-2 (c, d) and sample-3 (e, f) are presented for circular plotting of genotyped variants (a, c, e) and downsampling genotyping results (b, d, f). a, c and e: SNVs: (black lines) and indels (red) are plotted in relation to the mitochondrial genes map. Heteroplasmic short-read variants (blue background) are shown as short lines, while all long-read variants are homoplasmic (orange background). Corresponding coverage for short reads (in blue) is plotted on a circular scale of 6000 reads, while the long reads coverage (light red) is plotted on a circular scale of 350. b, d, f: The first plot inside of the genes map represents the full coverage of long-reads. Each subsequent plot represents the genotyped variants at different levels of downsampling (100, 80, 60, 40, and 20%). Black lines represent SNVs and red lines represent indels. This figure was plotted using circus package [36]

mtDNA haplogroups

Two samples were assigned a single haplotype (B4c1a1 for sample-1 and D4a1a1 for sample-2), while one sample (sample-3) was assigned both a major and a minor haplogroup (D4b2b1 at 98.6% and D4b2 at 87.3%) (Table 1). The two haplogroups assigned to sample-3 are not phylogenetically distant, therefore, it is very unlikely to be due to contamination.

Homoplasmy versus heteroplasmy

The high depth of reads of short-read sequence data, and its analysis using Mutect2 tool from GATK 4.1 [21] which is specialized for detecting variants with high sensitivity at different VAFs, made it possible to reliably detect heteroplasmic variants. On the other hand, the substantially lower coverage of PacBio’s long reads analyzed using Arrow algorithm did not yield any heteroplasmic variants.

In order to accurately compare the performance of the two technologies in identifying variants, including the heteroplasmic ones, it is necessary to mask variants in tagged regions, since most of these variants are artifacts, and therefore likely to present in heteroplasmic form (Supplementary Tables 3, 4 and 5).

The majority of heteroplasmic variants lie in the tagged regions, using Sanger sequencing to reliably verify heteroplasmic variants at these regions was not possible due to the low VAFs of these variants, and the long homoploymers in these regions.

Cohen’s kappa coefficients

The weighted kappa coefficient for samples 1, 2, and 3 at full coverage are 0.908, 0.980, and 0.997, respectively (Table 2). Based on the standard interpretation of these values (Supplementary Table 2), these weighted kappa coefficients indicate that the levels of agreement between long- and short read mtDNA genotyping are “almost perfect” at full coverage. With 82.4%, 96%, and 99.4% reliability for samples 1, 2, and 3, respectively.

Table 2 Calculated kappa-coefficient for the three samples at different coverage percentages

Sanger sequencing

All of the discrepant variants between two technologies we tried to confirm using Sanger sequencing were in tagged regions of low complexity. Other discrepant variants that are outside the tagged regions have VAFs below 0.1, which cannot be reliably confirmed using Sanger sequencing. However, we still tried to confirm one variant in sample-1 at position 240 which is of VAF of 0.096, but unfortunately the signal generated was unreliable to validate or reject it.

A GC-rich region, the tagged low complexity region 16181–16193, was re-sequenced using specific protocol for GC-rich regions for both sample-1 and sample-2, however, the obtained results still failed to provide clear validation of the results, despite using the special protocol.

Similar results were seen in the rest of the discrepant variants we tried to confirm using Sanger sequencing.

Long-read random downsampling and its effect on genotyped variants

To have a better understanding of the relationship between genotyping and the depth of reads of PacBio’s long-read sequencing, random downsampling of the reads was performed on each of the three samples. By removing 20% of the reads successively, and compare the genotyping results of 20, 40, 60, 80, and 100% of the total coverage. Figure 1b, d and f show the variants genotyping at different coverages for each sample.

After random downsampling, weighted kappa-coefficient of agreement between variant callings at different coverage levels against short-read results of each sample was done (Supplemntry Tables 6, 7 and 8), following the same procedure that includes tagging variants in regions described in the methods section. Table 2 shows the calculated kappa-coefficients for the three samples.

Table 2 shows calculated kappa values at different levels of coverage following downsampling. The quadratic weighted kappa coefficient eliminates any residual chance of randomness when comparing the two analyses, therefore, when comparing the mean coverage at different downsampling levels for the three samples against the corresponding mean kappa value, we can see that at a mean coverage of 51 (for the 60% coverage level) the mean kappa value is 0.946; corresponding to an ‘almost perfect’ interpretation (Supplementary Table 2). Furthermore, the mean kappa value at mean coverage of 37 (for the 40% coverage level) is 0.823, which is interpreted as ‘strong’ level of agreement, indicating that 67.656% of the data are reliably agreeable and not due to chance.

However, these values widely fluctuate among the three samples, since each sample has different initial full-coverage value, and due to other sample-specific values related to sample preparation or experimental conditions.

Short-read downsampling and its effect on genotypes variants

To compare the effect of downsampling on short-read data with that of long-read data, downsampling was done for each sample at seven different depths of reads, 1000, 500, 100, 50, 30, 20, and 10×. Supplementary Tables 9, 10 and 11 show the allele frequency for each genotyped variant at different depths of coverage.

Results of the short-read downsampling show three main findings: (a) despite the dramatic reduction of read-coverage, the total number of variants remains highly consistent (Supplementary Fig. 1); (b) changes in the number of variants occurs mainly within the masked regions, confirming their liability for being erroneous; and (c) the proportion of total number of downsampling-associated erroneous variants is comparable to the proportion of total number of discrepant variants between long- and short-read sequencing, (see Supplementary Tables 3, 4 and 5).

Discussion

When compared to short-read technology, SNV and Indel genotyping of PacBio’s long-read data is considerably consistent. However, due to the limitation of resources only three cases were available, where both long- and short-read whole-genome sequencing were done, therefore, the total number of heteroplasmic variants was limited. Nevertheless, the downsampling process could still provide better understanding of the relationship between the accuracy of long-read genotyping and other parameters, including depth of coverage and other possible sample-specific factors like DNA quality and library preparation. The different, fluctuating kappa values at different coverages for different samples can be partially explained by the depth of coverage, as shown in Supplementary Tables 3, 4 and 5. The random downsampling confirms the following two facts: (1) Reducing the coverage does not necessarily lead to a corresponding reduction in the number of variants. On the contrary, due to the noise in long reads caused by random indel errors, the model starts to erroneously call false variants as the reduction goes below coverage of around 37 reads, (2) long-read random errors are responsible for generating the false variants at low coverage, unlike short-reads where the removal of fractions of the reads doesn’t lead to significant increase in false variants. It indicates that, as expected, the hidden Markov model of the Arrow algorithm requires a certain percentage of reads to accurately call variants, rather than predominantly rely on the majority of votes by reads at each position.

Therefore, coverage is very critical for the reliability of using single-pass long-read data for SNV genotyping.

An attempt to confirm discrepant variants at tagged regions using Sanger sequencing was unsuccessful, as described in the “Results” section, due to a combination of low complexity of genomic sequence at the discrepant sites, and technical limitations of Sanger sequencing when trying to confirm regions with variants of low VAFs.

In a study conducted on more than 1500 cases of the ClinSeq study (NHGRI, USA) by Beck, Biesecker, and others [35], they concluded that using Sanger sequencing as the gold standard for confirming NGS variants is not always a proper thing to do. In that study, over 5800 NGS variants were analyzed using Sanger sequencer, and in some cases Sanger sequencing can reject true positive variants instead of eliminating false positive ones.

Given the high agreement between the long- and short-read technologies, it indicates that using long-read sequencing for genotyping short variants, in addition to structural variants, might be a highly cost-effective choice. However, larger studies, using more samples can provide stronger evidence for or against these conclusions.

The high consistency of genotyped variants with the downsampling of short reads demonstrates the expected robustness and accuracy of short-read data, however, this does not exclude the possibility of persistent erroneous genotypes due to systematic errors. The comparable number of fluctuating erroneous variant numbers with downsampling to the total number of discrepant variants between short- and long-read sequencing, among the three samples, is possibly due to the effects of DNA quality, regardless to the sequencing technology.

Finally, although the recent development of HiFi consensus sequences by PacBio can provide more accurate sequences than the standard subreads, by performing multiple passes over the DNA segment, HiFi is still costly and the main scope of this study is to compare single-pass long- and short-read sequencing data.