Detecting and phasing minor single-nucleotide variants from long-read sequencing data

Cellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrate that iGDA can accurately reconstruct haplotypes in closely related strains of the same species (divergence ≥0.011%) from long-read metagenomic data.


Introduction
called Random Subspace Maximization (RSM) to estimate the maximal conditional substitution rate   Figure 1C, on the Bordetella spp. data, 116 the real SNVs and the sequencing errors are highly distinguishable based on the maximal conditional 117 substitution rate calculated by the RSM algorithm. 118 It is very important to note that the number of dependent loci p should not be fixed. Figure   119 S1 shows an example that fixing p can induce false negatives. In this example, the substitution 120 at the locus 1 is independent with the substitutions at locus 2 and locus 3 respectively, but highly 121 dependent on the combination of the substitutions at locus 2 and locus 3. Thus, the SNV at locus 1 122 is difficult to be detected if p is fixed to 1, but is easy to be detected if there is no restriction on p.

123
The existing algorithms V-Phaser and V-phaser2 15,16 were designed to identify minor variants from 124 short-read sequencing data and only leveraged dependence between substitutions at two loci to avoid 125 combinatorial explosion. This is equivalent to fixing p to 1, and making these algorithms unable to 126 detect the SNVs in Figure S1. The proposed RSM algorithm has no restriction on p and can avoid 127 combinatorial explosion. 128 If a SNV is the only SNV in the genome, we call it an orphan SNV. The proposed framework 129 that uses conditional substitution rate to detect SNVs cannot detect orphan SNVs because its basic 130 assumption is that there are multiple real SNVs in the same genome. We propose a single-locus based 131 algorithm to overcome this limitation ( Figure 2D). We discovered that substitution error rate is very 132 different from locus to locus and it is highly predictable by sequence context (Figure 3). We trained 133 a gradient boosting model 31 on independent public data and predicted substitution error rate for 134 each locus. We then adopted a likelihood ratio test to compare the observed substitution rate to the SNVs are masked before applying ANN algorithm. A major advantage of ANN algorithm is that it 143 can estimate the number of clusters automatically while clustering the reads. To reduce false positive 144 rate of the draft contigs, we adopted a two-step filter to remove unreliable draft contigs ( Figure 2G).  " is the encoded substitution at locus i. Use the maximal conditional probability of { ()* = 3348} in all the subspaces to determine whether locus 837 has a SNV.

CTGTCAGACAACGAAATGGAC
Base X = A, C, G, or T The locus of X has a detected SNV

Read
Reference CGCGTCAGACXGACGAACTGAC

Aligned to
The substitution of the read is X if the the reference with X has the largest alignment score.   Intuitively, the SNVs in the same draft contig should be dependent with each other and the difference   146 between two similar draft contigs should be statistically significant.

147
The lengths of the draft contigs are usually smaller than genome size. To maximize the range where 148 the minor SNVs can be phased, we assemble the draft contigs using an algorithm inspired by overlap 149 graph 32 ( Figure 2H) (details are in Methods). The assembled draft contigs are called contigs.

150
Evaluating performance on pooled PacBio sequencing data 151 We constructed two datasets to test the accuracy of iGDA. The first dataset is a mixture of PacBio   the total number of aligned reads. The relative abundances of the samples distinct from the reference genome range from 0.25% to 3.05% for the Bordetella spp. data, and range from 0.30% to 1.92% for 177 the Escherichia coli data. The average relative abundances are 0.82% and 0.74% for the Bordetella 178 spp. data and the Escherichia coli data respectively.

179
For detecting minor SNVs, we tested three algorithms-a single-locus method (SL), which simply 180 uses substitution rate of each locus to detect SNVs; a context-aware single-locus method (SLC), which 181 uses substitution rate of each locus with correcting sequence-context effect (details are in Methods); 182 and the proposed RSM algorithm-on these two test datasets. The results indicate that RSM algorithm 183 greatly outperforms the two single-locus methods, and achieves a high accuracy ( Figure 4A and Figure   184 4B). With masking bases with QV lower than 8, iGDA detected 96.7% and 85.8% of the real SNVs 185 at false discovery rate (FDR) lower than 1% for the Bordetella spp. data and Escherichia coli data 186 respectively. Besides, correcting sequence-context effect substantially increases detection accuracy of 187 the single-locus methods. The threshold of base QV also has minor impact on the accuracy. A non-zero 188 threshold increases the accuracy on the Bordetella spp. data ( Figure 4A), but decreases the accuracy 189 on the Escherichia coli data ( Figure 4B). This might be because masking bases with low QV removes 190 some sequencing errors but reduces effective sequencing depth.  Figure 5B and Figure S2. The results show that 198 the iGDA-inferred contigs match the real contigs very well, even for the real contigs with frequencies 199 lower than 1%. In Figure 5B, there are five real contigs that are not detected by our algorithm. One  Evaluating performance on pooled ONT sequencing data 205 We tested iGDA on a dataset consisting of a mixture of ONT sequencing data of 65 Klebsiella pneu-206 moniae samples. The SRA IDs are listed in Table S2. We downloaded the raw data in fastq format 207 from the SRA database (https://www.ncbi.nlm.nih.gov/sra), filtered and trimmed the reads using  (Table S2) from NCBI (https://www.ncbi.nlm.nih.gov/assembly), and aligned the assembled 217 genomes to the reference genome using MUMmer 36 . The union of the SNVs reported by MUMmer 218 were used as benchmark to evaluate accuracy of detecting SNVs. The genome sequence of an indi-219 vidual sample is defined as a real contig, and was used to evaluate the accuracy of contigs reported 220 by iGDA. We used the same method in the previous section to merge identical samples and obtain 221 the relative abundance of each sample. The relative abundances range from 0.20% to 9.30%, and the 222 average relative abundance is 3.20%.

223
Due to the unique sequencing mechanism of ONT, DNA methylation can affect the raw sequencing 224 signal and substantially increase the base-calling error rate of methylated bases ( Figure S3). The 225 base caller used in the public ONT data in this study is Albacore (version 2.0) (https://github.com/ 226 Albacore/albacore). To avoid the impact of DNA methylation, we developed an algorithm to identify 227 DNA methylation motifs in bacteria without using raw-signal of ONT data (details are in Methods). 228 We masked loci within 5 bases to the DNA methylation motifs before applying iGDA to this dataset. 229 The result shows that the RSM algorithm substantially outperforms the single-locus methods to 230 detect minor SNVs, and achieves a high accuracy ( Figure 4C). With DNA methylation and bases with 231 QV lower than 10 masked, iGDA detected 92.8% of the real SNVs at FDR lower than 1%. With 232 masking no DNA methylation but masking bases with QV lower than 10, iGDA detected 41.3% of the 233 real SNVs at FDR lower than 1%. Thus, masking DNA methylation increases the accuracy of the RSM 234 algorithm ( Figure 4D), which demonstrates the importance of removing DNA methylation or applying  and Figure 4D). In contrast to PacBio data, correcting sequence context does not significantly increase 238 detection accuracy of the single-locus methods. We speculate that this is because the prediction power 239 of sequence context on the ONT data is weaker than that on the PacBio data ( Figure 3).

240
DNA methylation has a large impact on the accuracy of phasing minor SNVs. With masking loci 241 affected by methylation and bases with QV lower than 10, the average accuracy of assembled contigs 242 is 90.7% ( Figure 5A). However, without masking loci affected by methylation, the average accuracy of 243 assembled contigs is only 54.5% ( Figure 5A). An IGV snapshot of methylation-masked contigs is shown 244 in Figure S4.   Figure 6B).

272
It is worth to note that some genome regions in Figure 6A are not covered by any contig. We call 273 these regions missed regions, and call the SNVs not covered by any contig missed SNVs. We found 274 that there are usually multiple strains that are highly similar to each other in the missed region. In  Table S3, and reran iGDA on the new data.

286
The result shows that the accuracy of each contig is not significantly changed by excluding highly 287 similar strains ( Figure S7B). However, the length of contigs and proportion of SNVs covered by contigs 288 are substantially increased ( Figure S7C and Figure S7D). The species other than Borrelia burgdorferi 289 have limited impact on the results, because most of the reads from these species (Table S3)  For the ith aligned read, we encode its substitution at locus k of the reference genome by the following 318 formula: , where r ik is the base (short for nitrogenous base) of the ith aligned read at locus k, t k is the base at 320 locus k of the reference genome and is an empty element, which is formally defined by { } = ∅. The 321 first locus of the reference genome is 0 throughout this paper unless otherwise stated. The ith read is 322 represented as a set of substitutions and its covering range (Figure 2A), and is denoted by . b i and e i are the start and end loci of the region covered by the read respectively, and S i is . The most intuitive way to detect SNVs is to use the substitution rate of each locus. Formally, we 325 denote the encoded substitution at locus k as a random variable X k , and denote probability of the 326 event {X k = x k } as P r(X k = x k ), where x k ∈ {4k, 4k + 1, 4k + 2, 4k + 3}. Substitution rate is defined 327 as the estimated P r(X k = x k ), which is , where {·} is a set and |·| is the number of elements in a set. Intuitively, in equation (4), the numerator 329 is the number of reads with substitution x k at locus k, and the denominator is the number of reads 330 covering locus k. Due to the high error rate of long-read sequencing data, it is inaccurate to detect minor 331 variants using substitution rate alone ( Figure 1B). Herein, we leverage the information of multiple loci 332 to increase the detection accuracy. Assuming sequencing errors are independent with each other, real 333 SNVs are likely to be present if there are multiple reads containing the same set of substitutions ( Figure   334 1A). The conditional probability of {X k = x k } given other real SNVs of the same genome is therefore 335 much larger than the marginal probability of {X k = x k } if x k is a real SNV, because these real SNVs 336 are positively dependent ( Figure 1A and Figure 1C). Formally, the conditional probability of event . Intuitively, in equation (5) {P r(X k = x k |X g 1 = x g 1 , X g 2 = x g 2 , ..., X gp = x gp )} . The substitution x k is detected as a real SNV if H(x k ) is larger than a threshold (0.65 in this study).

345
H(x k ) is also called maximal conditional substitution rate. To avoid high variance of the estimated 346 P r(X k = x k |X g 1 = x g 1 , X g 2 = x g 2 , ..., X gp = x gp ) (equation (5)), we require that |{i | {x g 1 , x g 2 , ..., x gp } ⊆ The greedy algorithm and its theoretical accuracy 362 We introduce a fast but inaccurate greedy algorithm to estimate H(x k ) (equation (6)), and then is the covering range of read R i (equation (2)). We estimate P r(X k = x k |X g = x g ) by equation complement of a set), and sort the loci according 368 to P r(X k = x k |X g = x g ) in descending order. The sorted loci are denoted as {s 1 , s 2 , ..., s t l +tr }, and

373
The naive greedy algorithm described above avoids combinatorial explosion but might have low 374 accuracy. We assume x k , x g 1 , x g 2 , ..., x g p are p + 1 real SNVs of the same genome, and x g 1 , x g 2 , ..., x g p 375 are the only p substitutions that can maximizeP r(X k = x k |X g 1 = x g 1 , X g 2 = x g 2 , ..., X gp = x gp ).

381
[k − t l , k + t r ] ∩ {k, g 1 , g 2 , .., g p } c . For any locus g s ∈ {g 1 , g 2 , .., g p }, the probability that it is selected . The probability that the greedy algorithm correctly estimates H(x k ) is . According to inequation (7), assuming t l ≥ 2000, t r ≥ 2000, and p = 1, which is a typical setting 387 for long-read sequencing data, the probability that the greedy algorithm correctly estimates H(x k ) is 388 less than 3.5 × 10 −18 even if ρ 0 = 0.99. The key factor leading to the failure of the greedy algorithm 389 is selecting from too many loci (t l + t r loci). We propose a novel algorithm called Random Subspace 390 Maximization (RSM) to reduce the number of loci to be considered in the next section.

391
Improving accuracy of the greedy algorithm by Random Subspace Maximization 392 Firstly, we measure the similarity between two reads, R i and R j , by a modified Jaccard index 34 , which 393 is defined by where l min is the minimal length of the overlap region between the two compared reads. We used 396 l min = 0.5(e i − b i ) in this work. Intuitively, the Jaccard index between two reads is the ratio between 397 number of common substitutions shared by the two reads and the total number of substitutions of the 398 two reads in their overlapped region. Then, for a read R i , we select w most similar reads according to by R i and R j . Formally, Figure 2B), and we can generate w × m subspaces if there are m reads. 402 We used w = 100 in this work. For a substitution X k ∈ C ij , we estimate its maximal conditional 403 probability of {X k = x k } in subspace C ij , which is defined by , using the greedy algorithm described in the previous section by only considering the substitutions 405 in C ij . Thus, compared to the original greedy algorithm, the number of loci to be considered is 406 substantially reduced. We then use to estimate the maximal conditional probability of {X k = x k } defined by equation (6). Without loss of generality, we denote {x g 1 , x g 2 , ..., x gp } as the only set of substitutions that maxi-413 mizesP r(X k = x k |X g 1 = x g 1 , X g 2 = x g 2 , ..., X gp = x gp ), and Ω as the set of subspaces contain-414 ing {x k , x g 1 , x g 2 , ..., x gp }. For a subspace C t ∈ Ω, the probability that the greedy algorithm finds where H k is defined by equation (6). The 416 probability that RSM algorithm finds {x g 1 , x g 2 , ..., x gp } is . Assuming P r(Ĥ Ct (x k ) = H(x k )) > 0, and according to chain rule of joint probability, As sequencing depth increases, |Ω| increases, and P r( ∩ Ct∈Ω {Ĥ Ct (x k ) = H(x k )} ) converges to 0. Thus, 420 P r(Ĥ(x k ) = H(x k )) (equation (11) ) converges to 1 as sequencing depth increases. Intuitively, with 421 infinite sequencing depth, RSM algorithm is guaranteed to detect real SNVs correctly if these SNVs 422 have larger maximal conditional probabilities than sequencing errors.

423
Detecting orphan SNVs by correcting sequence context effect 424 As RSM algorithm requires multiple real SNVs, it can not detect orphan SNVs. An orphan SNV is 425 the only SNV of the genome. We have to rely on the single-locus algorithm described in equation (4) 426 to detect orphan SNVs. However, the substitution rate of a locus is not only affected by real SNVs 427 but also affected by the sequence context of the locus. We built a gradient boosting 31 model to learn 428 the sequence context effect and corrected it by the following likelihood ratio method ( Figure 2D). For 429 a substitution x k at locus k, its likelihood ratio is , where Binomial(x; n, p) is the probability mass function of binomial distribution with parameters n 431 and p, and we also required a detected SNV has a substitution rate higher than 0.1 for PacBio data and 0.2 for 435 ONT data respectively.

436
Modeling sequence context effect on sequencing error rate

437
Error rate of long-read sequencing is strongly affected by sequence context (Figure 3). For locus i, 438 we define its one upstream homopolymer and one downstream homopolymer as its sequence context 439 ( Figure S8). We adopted the gradient boosting model implemented by xgboost ( and maximal depth of trees (max_depth in xgboost) and used the parameters with the highest five-fold 448 cross-validation accuracy (Table S5). We used R 2 as the measurement of accuracy, which is defined by where y i is the substitution rate of a sequence context,ŷ i is the predicted substitution rate,ȳ is the   (2)), we use , where S i is defined in equation (3).

470
The intuitive idea of ANN algorithm is that all loci should be homogeneous by piling up the reads 471 in each cluster ( Figure S9). A locus is homogeneous if it satisfies the following condition. For locus k, 472 its substitution rate satisfies , where x k ∈ {4k, 4k + 1, 4k + 2, 4k + 3}. In this work, We set p lim = 0.2 for the PacBio data and

482
A problem of the algorithm described above is that the alignment is affected by reference bias 483 and homogeneous loci could be mistaken for heterogeneous loci. Reference bias is the phenomenon 484 that the substitution rate of a real SNV at a homogeneous locus is significantly lower than 1 − 485 substitution error rate ( Figure S10A). are 2, -4, -4, and -2 respectively, and the score of a base aligned to base N or a masked low-QV base is 0.
detected SNV. For each read, the modified base in the reference sequence with the highest alignment 493 score is recorded as a substitution of the read ( Figure 2E and Figure S11). We tested the realignment 494 method on a single Escherichia coli dataset (SRA ID is ERS718594), which is presumably homoge-495 neous. The result shows that local realignment can substantially reduce reference bias ( Figure S10B).

496
The average substitution rate of loci with real SNVs is 84.8% before realignment, and the average 497 substitution rate of loci with real SNV is 95.9% after realignment. We performed local realignment 498 before ANN algorithm in our analysis.

499
Filtering draft contigs 500 To reduce false positive rate of the inferred draft contigs by ANN algorithm, we adopted a two-step 501 algorithm to filter the draft contigs ( Figure 2G). In the first step, we tested whether the frequency of 502 each individual SNV in each contig is significantly higher than the sequencing error rate and whether 503 SNVs in each contig are independent using Bayes factor. The contig is filtered if the frequency of 504 any of its SNVs is not significant and its SNVs are independent ( Figure S12A). In the second step, 505 we compared the contigs pairwise, and the contig with lower abundance in each pair is filtered if the 506 contigs are not significantly different according to Bayes factor ( Figure S12B).

507
Assembling draft contigs 508 The length of the draft contigs obtained by ANN algorithm is usually smaller than genome size, except 509 in a few cases like a virus genome. Therefore, we have to assemble the draft contigs to obtain the 510 whole picture of the underlined genomes in the sequenced sample. We borrowed the idea of overlap 511 graph 32 from de novo genome assembly to assemble the draft contigs. We denoted each draft contig has more than one parental vertices; 3) in-degrees and out-degrees of the vertices other than the start 522 vertex and the end vertex are 1 ( Figure 2H and Figure S13C). We then filtered the contigs using the 523 two-step filter introduced in the previous section. We calculated the Jaccard index of each read to all 524 the contigs, and assigned the read to the contig with the largest Jaccard index. A read is assigned to 525 the reference genome if its largest Jaccard index is smaller than 0.5.

526
Detecting bacterial methylation motifs from ONT data without raw signal 527 As the raw-signal files of ONT data are usually huge and not publicly available, we developed an  Evaluating the minimal divergence that two conspecific strains can be ditinguished 543 We only retained the iGDA-reported contigs that is 100% identical to a true genome sequence and 544 only has an unique closest true genome sequence. These retained contigs can be used to distinguish 545 conspecific strains. We calculated the divergence between two contigs by 546 Divergence(contig 1, contig 2) = number of different SNVs length of overlapped region We used "nucmer -c 150 -g 500 -l 12 -maxmatch" for alignment, and "show-snps -l -T -H" to obtain 551 SNVs. To avoid the impact of repeats we used "mummerplot −−filter" before "show-snps -l -T -H" 552 for the metagenomic data.