Illumina reads correction: evaluation and improvements

The paper focuses on the correction of Illumina WGS sequencing reads. We provide an extensive evaluation of the existing correctors. To this end, we measure an impact of the correction on variant calling (VC) as well as de novo assembly. It shows, that in selected cases read correction improves the VC results quality. We also examine the algorithms behaviour in a processing of Illumina NovaSeq reads, with different reads quality characteristics than in older sequencers. We show that most of the algorithms are ready to cope with such reads. Finally, we introduce a new version of RECKONER, our read corrector, by optimizing it and equipping with a new correction strategy. Currently, RECKONER allows to correct high-coverage human reads in less than 2.5 h, is able to cope with two types of reads errors: indels and substitutions, and utilizes a new, based on a two lengths of oligomers, correction verification technique.

1 Experimental results

Human variant calling
Figure 1 shows results for H. sapiens VC with Strelka, evaluated with hap.py (similar to the main maper).Missing bars correspond to experiments, when the correction failed.Due to low best (i.e.maximizing geometric mean of F-1 scores for SNPs and indels) k values for Lighter and Musket we performed the experiments also for the maximal odd values of that parameter (k = 31 and k = 27, respectively -for higher values the results do not change).The results are shown on Figure 2.
Figure 3 shows results for H. sapiens VC with DeepVariant, evaluated with hap.py.In the experiments, we used the same reads like for experiments with Strelka and hap.py.
Results are different from the ones obtained with Strelka.Especially, there are observable poor outcomes of RECKONER and BFC for 15× set.In general, results similar to the uncorrected reads are obtained just for practically non-correcting Lighter and Musket and CARE (which was not able to perfrom correction for 45× and 60×).For indels, CARE and -for at least 30× depth -BFC achieve similar results to the raw reads, but it may be due to none of them is designed to correct indels.
Figure 4 shows results for H. sapiens VC with Strelka, evaluated with Syndip.In the experiments, we used the reads from a ERR1341796 dataset.To generate different sequencing depths we shuffled the read pairs and extracted a number of them from the beginning of the shuffling result, similarly like for A. thaliana.The whole input dataset sequencing depth is 55×, hence we chose it instead of 60×.

Variant calling results -genotype concordance
Figure 5 shows genotype concordance sensitivity and precision results for H. sapiens VC with Strelka.Measures were computed with GATK.

Variant calling results -homo-vs heterozygous variants
Figure 6 shows F1-score results H. sapiens VC with Strelka, evaluated with hap.py (similar as shown in Figure 1) separately for homorozygous and heterozygous variants.We calculated the values by splitting both GT and obtained variants to two sets: containing 1/1 (or 1|1) and 0/1 (or 0|1) GT fields, respectively.Then we evaluated the sets with hap.py.
Unfortunately, the only algorithm with an option for homo-and heterozygous data differentiate -Karect -was not able to correct the reads.In a case of another algorithms clearly visible is a huge improvement of heterozygous SNPs for 15× set.It is a desirable behavior, as one can expect, that the non-heterozygous-aware algorithms may degrade these case.For higher depths the phenomenon gradually disappears, and finally results for the homo-and heterozygous variants become similar, expect for RECKONER 2, which deal better with heterozygous indels, and Blue, which generally seems to be strongly adapted to homozygous variants.

A. thaliana variant calling
Figure 7 shows additional results for A. thaliana VC with Strelka, evaluated with hap.py. Figure 8 shows results for A. thaliana VC with DeepVariant, evaluated with hap.py.In the experiments, we used the same reads as for experiments with Strelka and hap.py.

De novo assembly
Figure 9 shows additional de novo assembly measures for C. vulgaris.Figure 10 shows de novo assembly measures for C. vulgaris, assembled with Velvet.
Figure 11 shows de novo assembly measures of long MiSeq reads for P. syringae.Figure 12 shows de novo assembly measures for P. syringae, assembled with Velvet.

Reads mapping
Figure 13 shows reads mapping characteristics for C. vulgaris.Measures once and multiple denote a fraction of the reads mapped once and multiple times to the reference genome, respectively.Measures ins and del denote a fraction of reads affected by insertion or deletion, respectively.A measure frac(d) denotes a fraction of reads differing from the genome with an edit distance d.The simulation was performed with ART.As quality profiles we utilized the first reads of pairs from the sets DRR031158 (for length 100 bp) and SRR1802178 (for 150 bp).As k-mer lengths we used the best value for L100D20 cases.We examined a correction efficacy on NovaSeq reads.The reads may have the following different characteristics in contrast to a popular HiSeq instruments [1]:

Simulated reads
• quality binning -number of possible values of symbols a quality scores is reduced to 4 (in contrast to 8 or about 40 in older machines).Correctors may be affected when they rely detection or correction strategies on scores values, what is a case in some algorithms; in the other algorithms we have lack of information about quality scores utilization.Moreover, the binning may be especially important, when the algorithm rely just on quality scores to detect erroneous regions in the read, as it was done in Trowel [4] (discussed in one of our previous work [2]); • overestimated quality scores -probabilities of substitution errors coded in quality scores are lower than the real error rates.Correctors may be affected under circumstances as in the above point; • different, depending on position substitution rates of different errors -probability of substitution between a specified symbol and another one is not constant, but depends on values symbols and position in the read.Correctors may be affected, when an assumption about unique characteristics of different read regions is made.Generally, it is a case in all of the correctors.
Table 1 shows a potential vulnerability of different correctors on those traits.After correction a single read mapped to the locus, causing a new false variant to be called.However, the read seems to have a sequence identical with the genome.
3 Data sources As the ground truth sets for human variants evaluated with Syndip we used the set available with Syndip itself.
Table 3 shows accession numbers of read datasets, reference genomes identifiers and versions of ground truth VC sets used in the experiments .The genome lengths were expected ungapped length values from the NCBI database: https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size?species_taxid=, where the taxid values are shown in Table 4.

Experiments details
If possible, all the algorithms were parametrized with gzipped paired reads.When the algorithm was not supporting paired reads, we concatenated the input files, passed them to the algorithm and the splitted the output file.If the algorithm was not supporting gzipped input, we passed decompressed ones.
The measurable values were chosen as follows: • <alpha> -as the authors specified, the value should be inversely proportional to sequencing depth and for depth 70 should be 0.1, so the value equals 7 <depth> , • <cutoff> -the value of cutoff threshold generated by RECKONER, • <kmers> -the value ,,No. of unique k-mers" returned by KMC [3].

Figure 15
Figure 15 shows results of correction reads generated in silico.The measures are defined as follows: gain = |T P |−|F P | |T P |+|F N | , sensitivity = |T P | |T P |+|F N | , precision =

Figure 16 :Figure 17 :Figure 18 :Figure 19 :Figure 20 :Figure 21 :
Figure 16: ERR174324, chromosome 1, locus 818,604, site of FP SNP corrected by RECKONER 2. A single G symbol in the uncorrected reads caused a false variant detection, its successfull correction caused the variant to disappear.
Correctors oligomer length (k) for de novo assembly, reads mapping and simulated reads Algorithm C. vulgaris P. syringae C. vulgaris (simulated)

Table 3 :
Read datasets and genomes used in the experiments not shown in the main article

Table 5
presents versions of correctors used in the experiments.The other used algorithms were shown in Table6.