Building redundancy into fluorogenic sequencing makes for error-free DNA sequence reads.
Sequencing methods are dominated by commercialized technologies—so much so that most academic researchers consider the field too risky to wade into. That has not deterred Yanyi Huang, a researcher at Peking University, from exploring better ways to generate very accurate DNA reads at high throughput. His forays began after a conversation with his departmental colleague X. Sunney Xie, who published a method in 2011 that couples pyrosequencing with fluorogenic output.
In fluorogenic pyrosequencing, a polymerase is used to synthesize DNA opposite a template strand. A, C, G and T nucleotides are added one at a time with washes in between, and they include chemical modifications that release a fluorescent signal only when a nucleotide is incorporated by the polymerase. The process is rapid, but stretches of the same base are hard to decipher because the signal they generate is not proportional to sequence length.
Huang noticed a key inefficiency in pyrosequencing: “At every given cycle, you have a big probability that the nucleotide you add wouldn't be added, so you have a dark cycle—zero signal.” He had been reviewing the seminal 1977 paper by Allan Maxam and Walter Gilbert on DNA sequencing by chemical degradation for a class that he taught, and he realized that the authors used signals from overlapping sets of cleaved bases to determine the original sequence. This made him wonder whether nucleotide combinations could be worked into pyrosequencing. “I talked to my student and said, why don't we try to do two bases?” recalls Huang. The idea evolved into sequencing the same template three times with different alternating pairs of nucleotides (e.g., round 1 with A + C and G + T, round 2 with A + G and C + T, and round 3 with A + T and C + G) and removing synthesized strands between rounds. The process generates three orthogonal degenerate sequences, from which the underlying sequence can be decoded unambiguously.
Sequencing with only two orthogonal sets of nucleotides would be enough to decode sequence, but adding a third round enables error correction. Despite the redundancy, the sequencing reaction does not take longer than traditional pyrosequencing. “You actually use the same amount of reaction compared with single nucleotide addition...but you provide 50% more information,” says Huang.
To implement their method effectively, Huang and his team tested dozens of fluorophores until they found Tokyo Green, a fluorophore with a very high signal-to-noise ratio. They also replaced arrayed microchambers in Xie's original work with sequence amplification directly off of a glass flowcell for greater efficiency. The changes reduced the raw sequencing error rate to ∼0.5% in the first 200 nucleotides.
The researchers then developed algorithms using information theory to both detect and correct sequencing errors, something that is not possible with other platforms. They used dynamic programming to efficiently determine a globally optimal decoded sequence from among various possibilities in the presence of error. Their strategy, error correction code (ECC) sequencing, is essentially error free for the first 200 nucleotides. They also found a way to remove error in single-base stretches shorter than eight nucleotides to provide the high raw accuracy needed for ECC sequencing.
The published work uses highly pure identical viral template. But the researchers have since modified their protocol to enable cluster amplification, so that different sequences can be read out in the same flowcell. They are improving other aspects of the technology such as throughput and have licensed it to a company called Cygnus Biosciences. ECC sequencing could ultimately be broadly used to help applications for rare alleles, in which distinguishing true variants from sequencing artifacts is critical.
Chen, Z. et al. Highly accurate fluorogenic DNA sequencing with information theory–based error correction. Nat. Biotechnol. 35, 1170–1178 (2017).