Denoising of Aligned Genomic Data

Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE.

In this supplementary document, we provide details of our implementation. We also discuss the supplementary results that support the main text.

Data coverage
We tested SAMDUDE on three different paired-end WGS datasets of the H. Sapiens individual NA12878. The datasets (referred to by run accession number) are: ERR262997 with to 30×-coverage, CEUTrio.HiSeq.WGS with to 100×-coverage, and NA12878 V2.5 Robot 2 with to 40×-coverage, which we refer to these datasets as 1, 2 and 3, respectively in the main text. Denoising was also tested on a dataset with lower (15×) coverage, paired-end WGS dataset ERR174324. The results are shown in Supplementary Table 6, and demonstrate an overall lack of improvement and actual slight negative effect of SAMDUDE denoising. These results are not entirely surprising, given that SAMDUDE makes significant use of alignment information in order to estimate the noise channel, create counts vectors and to perform denoising, so we expect better denoising performance with higher coverage. Based on these results, we focused our efforts on denoising data with 30× or higher coverage, namely datasets 1, 2 and 3. This tends to be the coverage range of WGS data used in practice, especially for clinical purposes.

SAMDUDE operation
The following subsections discuss specific aspects of SAMDUDE operation used in our experiments.

Choice of k
For all denoising experiments, we used a single-sided context length of k = 7 (14 bases total in the double-sided context). This choice is rooted in the theory underlying DUDE, as well as in practical considerations. For discrete universal denoising using DUDE, the optimal single-sided context length k n depends on both sequence length and alphabet size: k n = c log M n with c < 1 2 , noise-free sequence alphabet size M and noise-free sequence length n [1]. Intuitively, the optimal context length maximizes the number of times each context is counted without skewing the context histograms towards a uniform distribution, which occurs when k is either too small or too large. In the genomic sequencing setting, M= 4, and n-interpreted as chromosome length-ranges from 51 × 10 6 up to 248 × 10 9 for somatic human chromosomes. For these values, k n = 5 or 6.
Supplementary Table 4 shows the results of denoising extracted SAM files for chromosomes 11 and 20 from dataset 1 using k = 5 and 6, and confidence probability threshold t p = 0.9. These values of k resulted in improvements in both S and P, but very little change in F-score. These results seemed to indicate that k was too short, since when k is insufficiently long, information from genomic locations other than the one under consideration may be incorporated, leading to artificially inflated counts in the m vectors. We hypothesized that denoising performance might be improved by limiting denoising decisions to reads that map closer to the one under consideration. To test this hypothesis, we tried larger values of k. Intuitively, larger k should ensure that most context counts are taken from the same pileup, while still allowing counts information to be obtained from misaligned or poorlymapped reads found elsewhere in the SAM file. Results for denoising dataset 1 with k = 10 for chromosome 20 are shown in the last line of Supplementary Table 4. Although the best denoising performance was obtained for k = 10, a computationally prohibitive amount of memory was required to store the context vectors. Thus, we decided to use k = 7.

Choices of t m and t p
For sequence and channel estimation we used a majority threshold of t m = 0.9 for a high-confidence genomic sequence estimate, and also to eliminate the confounding effects due to heterozygous genomic positions which might not have a clear majority base.
The choice of confidence probability threshold t p is a tradeoff. Too high a threshold might prevent the correction of true sequencing errors, while too low a threshold might result in SAMDUDE attempting to denoise where there are no errors. Supplementary Table 5 shows the results of denoising dataset 1 with t m = 0.9 and t p = 0.99, corresponding to a quality score of 20. Compared to the results in Supplementary Table 4, there is less improvement in sensitivity (in fact, worsening of sensitivity for chromosome 20 using k = 6), especially for the GATK filtered variant calls. Additionally, histograms of quality score changes shown in Supplementary Fig. 7 demonstrate that generally SAMDUDE quality score updating tends to shift the quality score distribution towards smaller values. In other words, SAMDUDE denoising decisions tend to decrease the "certainty" of the corrected base. It is unclear how this directly affects variant calling, but intuitively it makes sense to avoid lowering the quality score of bases that were called by the sequencer with high confidence. Finally, a lower threshold of t p = 0.9 has the additional benefit of reducing computational runtime and preventing over-processing of the original data. Based on these results, we chose to use a lower confidence probability threshold of t p = 0.9 for our denoising experiments.

Variations on SAMDUDE
Two "variations" on SAMDUDE were presented in the "Human chromosome denoising with SAMDUDE" subsection of the Results section in the main text: partial denoising and random noise. The results of these variations are summarized in Supplementary Table 1. Supplementary Table 2 lists the total number of bases in the original SAM files, as well as the percent of base changes under each denoiser in the chromosome files for each data set. In all curves, the performance of SAMDUDE is compared with lossy quality score compressors P-Block and R-Block, as well as with Musket and RACER. For ease of visualization, the results of denoising with BFCounter and Lighter are omitted.

Computational details
State-of-the-art denoising software In this section we provide the commands and parameters used for denoising with Musket, RACER, BFCounter and Lighter.

Musket
Musket was invoked using all default parameters, and the number of threads t specified with the -p parameter.

Variant calling pipeline
In this subsection we describe the variant calling pipeline.

Preprocessing
Prior to variant calling, all data was preprocessed using the steps recommended by the Broad Institute's Genome Analysis Toolkit (GATK) [2,3,4]. They are: conversion of the SAM file to the BAM format, file sorting, duplicate reads marking, group name adding, file indexing and quality score recalibration.
If the denoised file is in FASTQ format, alignment to a reference genome must be performed prior to applying all other preprocessing steps. This was done using the BWA mem alignment program and NCBI build 37 of the human reference [http://www.ncbi.nlm.nih.gov/assembly/GCF$\ _$000001405.13/] as the reference genome. We included the -M option for compatibility with Picard tools (http://broadinstitute.github.io/ picard/), and used the -t option to specify the desired number of threads for computation. The SAM file was converted to BAM format using SAMtools [?], specifying the number of threads with the -@ option and inclusion of the SAM header with the -h option.
$ samtools view -@ num_threads -b -h aln . sam > aln . bam The BAM file was sorted using SAMtools, with a temporary file prefix specified with the -T option, and the output file format specified with the -O option.
$ samtools sort -T ./ tmp -@ num_threads -O bam aln . bam > aln . sorted . bam Duplicates were marked using Picard tools, with M specifying the file to which metrics calculated during the duplication marking process were written. Note that marking the duplicates is sufficient to exclude them from downstream processes. Finally, the quality scores were recalibrated using the GATK Base Quality Score Recalibration workflow [2], and NCBI build 37 of the human reference as the reference genome.

Variant calling and filtering
We used the GATK Haplotype Caller for variant calling, specifying the target chromosome with the -L option.         Supplementary Figure 6: The same sensitivity vs. precision curves as in Supplementary Fig. 5, but with the rightmost point in each subplot removed for ease of visualization.  Figure 7: From top to bottom: quality score histograms for chromosome 20 reads taken from datasets 1, 2 and 3. Histograms summarize all original quality scores (light orange and outlined, right-hand axis) and scores only of bases changed under the SAMDUDE denoising rule using k = 7, t m = 0.9 and t p = 0.9 (dark orange, left-hand axis).