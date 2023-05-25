Before sequencing, DNA fragments of interest were circularized and captured on the surface of a flowcell. Clonal copies of DNA fragments were then created through rolling circle amplification, generating approximately 1 billion concatemers on the flowcell surface22,23,24,25. The resulting concatemers, referred to as polonies using the original term coined by Church and collaborators26, were used as the DNA substrate for sequencing. In contrast to the DNA nanoballs developed by Complete Genomics, polonies are amplified on-instrument following library hybridization to the flowcell27. This approach simplifies user workflow and eliminates the possibility that DNA fragments may interact in solution during the amplification process. We then constructed the avidite: a dye-labeled polymer with multiple, identical nucleotides attached. In the presence of a polymerase, the avidite was able to bind multiple complementary nucleotides specifically in concatemer copies of a DNA fragment within a polony. A polymerase and a mixture of four avidites, each corresponding to a particular label and nucleotide, were applied to the flowcell and used for base discrimination. The avidite was not incorporated, but provided a stable complex while enabling removal under specifically formulated wash conditions. Removal of the avidite left no modifications in the synthesized strand. The avidites decreased the required concentration of reporting nucleotides by 100-fold relative to single-nucleotide binding, yielded negligible dissociation rates and obviated the need to have nucleotides present in the bulk solution. A low avidite concentration leads to reduced use of fluorophores relative to the strategy of using high-concentrations of dye-labeled nucleotides. The advent of the avidite enabled us to separate the process of stepping along the DNA template from the process of identifying each nucleotide, and to optimize each for quality and reagent consumption. Figure 1a shows a complete cycle of avidity sequencing, Fig. 1b depicts a single avidite interacting with multiple DNA copies within a polony and Fig. 1c shows many avidites specifically bound to several polonies on the surface. Additional detail on the structure of one version of an avidite is provided in Extended Data Fig. 1.

Avidity sequencing overcomes the kinetic challenges of generating a signal by incorporation of a dye-labeled monovalent nucleotide. In bulk solution, incorporation of a dye-labeled nucleotide is limited by a specificity constant (k cat /K m ) that governs the observed rate of productive nucleotide binding and incorporation28. A specificity constant of 0.54 ± 0.22 µM−1 s−1 for monovalent dye-labeled nucleotides using an engineered polymerase was observed resulting from a maximum rate of incorporation (k pol ) of 0.86 ± 0.14 s−1 and an apparent dissociation constant K d (K d,app ) of 1.6 ± 0.6 µM (Fig. 2a). This apparent K d reflects the K m of a kinetic system not in equilibrium rather than the true K d of the nucleotide substrate29. To achieve complete product turnover, this high apparent K d can be overcome either by using increased concentrations of fluorescent nucleotide substrate or allowing longer incorporation time for completion of the reaction. Both paths used to overcome this substrate limitation have the undesirable consequence of either high cost or long cycle time. Together, the use of avidity substrates and DNA polonies containing many copies of substrate DNA in close proximity overcomes the limitations of incorporating a monovalent dye-labeled nucleotide.

Fig. 2: Nucleotide and avidite binding kinetics. a, Monovalent fluorophore-labeled nucleotide concentration dependence of the observed rate of incorporation. Time series were performed at each concentration and fit to a single exponential equation to derive a rate. Observed rates were plotted as a function of concentration and fit to a hyperbolic equation, deriving a value of k pol = 0.86 ± 0.14 s−1 and K d,app = 1.6 ± 0.6 µM. b,c, Real-time association kinetics of signal generation resulting from reacting multivalent avidite substrates (b) and monovalent nucleotides (c) with DNA polonies. d,e, Real-time measurement of signal decay following flow cell washing for imaging of multivalent avidite substrates (d) and monovalent nucleotides (e). Full size image

Using binding of the four labeled avidites for base identification established a binding equilibrium that reached saturation based on substrate concentration within 30 s to generate signal, rather than relying on catalysis. The binding kinetics of this interaction were monitored using real-time data collection to observe avidites binding to polonies with an association rate (k on,avidite ) of 271 ± 82 nM−1 s−1 (Fig. 2b). This observed association occurred within the limit of error of a single fluorescently labeled monovalent nucleotide (Fig. 2c). Major differences were observed in the dissociation kinetics of avidite substrates versus monovalent nucleotides. Avidite substrates bound to the DNA polonies tightly with no measurable dissociation over the >1-min timescale needed for imaging and base calling (Fig. 2d). This is in sharp contrast to fluorescently labeled monovalent nucleotides, which dissociated rapidly during the wash step following binding and then continued to dissociate during imaging (Fig. 2e). The negligible dissociation rate resulted in decreased K d of more than two orders of magnitude for avidites compared with monovalent nucleotides. With near-zero avidite dissociation rates, a persistent signal was achieved without the presence of free avidites in bulk solution, eliminating background. Without avidity, dissociation kinetics with monovalent nucleotides showed a fourfold signal decrease at the beginning of imaging due to rapid dissociation, as a result of disruption of the binding equilibrium during reagent exchange (Fig. 2e).

Sequencing instrumentation

Avidity sequencing was performed on the AVITI commercial sequencing system. Briefly, the instrument is a four-color optical system with two excitation lines of approximately 532 and 635 nm. The four-color system is created using an objective lens, multiple tube lenses and multiple cameras for simultaneous imaging of four spectrally separated colors. The detection channels for emission are centered at approximately 553, 596, 668 and 716 nm, respectively. Reagents are delivered using a selector valve and syringe pump to perform reagent cycling. The instrument contains two fluidics modules and a shared imaging module, enabling parallel utilization of two flowcells. Subsequent to image collection, data were streamed through an onboard processing unit that performs image registration, intensity extraction and correction, base calling and quality score assignment (Methods).

Accuracy of avidity sequencing

To evaluate the accuracy of avidity sequencing, 20 sequencing runs were performed using a well-characterized human genome. Sequencing data were used to train quality tables according to the methods of Ewing et al.30, but with modified predictors. Quality tables were then applied to independent sequencing runs. Figure 3 shows the data quality obtained in a representative run not used for training. Quality scores were well calibrated across the entire range, meaning that predicted quality matched observed quality as determined by alignment to a known reference. Combined over reads 1 and 2, 96.2% of base calls were >Q30 (an average of one error per 1,000 bp) and 85.4% >Q40, with a maximum of Q44, or approximately one error in 25,000 bases. For comparison, a publicly available PCR-free NextSeq 2000 dataset was downloaded from the Illumina public demo set repository (https://basespace.illumina.com/datacentral) and a publicly available NovaSeq 600 dataset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing/fastq). The NextSeq 2000 and NovaSeq 6000 datasets had 90.1% and 92.7% of data >Q30, respectively, and none of the base calls exceeded Q40.

Fig. 3: Predicted and observed quality scores for a 2 × 150-bp sequencing run of human genome HG002. a, Read 1 (R1). b, Read 2 (R2). Points on the diagonal indicate that predicted scores match observed scores. The histograms show that the majority of the data points are >Q40. Full size image

To obtain an additional measure of accuracy, we used the same datasets to compute the percentage of k-mers (k = 1, 2, 3) containing at least one mismatch after alignment to a well-characterized reference. Known SNP sites were masked before the comparison. When compared with NextSeq 2000 and NovaSeq 6000, we found that AVITI had the highest accuracy across four out of four 1-mers, 16 out of 16 2-mers and 58 out of 64 3-mers (Extended Data Fig. 2).

Homopolymer sequencing

Sequencing through long homopolymers has posed challenges for multiple sequencing technologies31,32. Although SBS improves homopolymer sequencing relative to flow-based technologies, the error rates of reads that pass through long homopolymer regions increase substantially33. Correction algorithms have been proposed to circumvent the inherent challenges with base-calling post-homopolymer repeats34, but the exact cause has not been fully established in the literature. In contrast to SBS, avidity sequencing leverages rolling circle amplification, polymerases evolved to accommodate the avidite complex formation and a separate polymerase evolved for efficient incorporation of unlabeled and 3′ blocked nucleotides. We evaluated the impact of these differences on sequencing through long homopolymers. Specifically, homopolymers of length 12 or more nucleotides were used to assess the accuracy of reads before and after homopolymer regions. Figure 4 shows the results comparing avidity sequencing with SBS, averaged across the ~700,000 homopolymer loci of length 12 or more. Average error rate of avidity sequencing remained stable following a long homopolymer (controlling for the fact that post-homopolymer stretch occurs in later cycles of a read). By contrast, the error rate of SBS reads increased by more than a factor of five following homopolymer stretches. Extended Data Fig. 3 shows the histogram of pairwise error rate differences between avidity sequencing and SBS for all long homopolymer loci. The avidity sequencing error rate outperformed SBS in >97% of cases and the magnitude of difference is correlated with homopolymer length (Fig. 5). Extended Data Fig. 4 shows representative loci from the 95th, 50th and fifth percentiles of the histogram.

Fig. 4: Post-homopolymer performance across platforms. Mismatch percentages of AVITI, NovaSeq 6000 and NextSeq 2000 reads before and after homopolymers of length 12 or greater. Full size image

Fig. 5: Comparison of mismatch rate following homopolymers of length between four and 29. Mismatch percentage difference between avidity sequencing and SBS increases with homopolymer length. The box plot shows median, quartiles and whiskers, which are 1.5× interquartile range. Full size image

Single-cell RNA-seq

To demonstrate sequencing performance across common applications, single-cell RNA expression libraries were prepared and sequenced. Two libraries from a reference standard consisting of human peripheral blood mononuclear cells were generated using the 10X Chromium instrument. The two libraries contain RNA from roughly 10,000 and 1,000 cells, respectively. Following circularization, the libraries were sequenced to generate paired-end reads with read lengths of 28 and 90 for reads 1 and 2, respectively, as recommended by the vendor. The analysis was done using CellRanger (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation). Because this reference standard is used by 10X Genomics to evaluate sequencing performance, a set of metrics and guidelines to assess sequencing results is provided along with the biological material. Extended Data Table 1 shows each metric, the guideline values from 10X Genomics and the performance of each sequenced library. All metrics were within the guide ranges, and metrics pertaining to sequencing quality exceeded the thresholds provided.

Whole-human-genome sequencing

Another common application is human-whole-genome sequencing. This application challenges sequencer accuracy to a greater extent than measurement of gene expression because the latter requires only accurate alignment while the former depends on nucleotide accuracy to resolve variant calls. To demonstrate performance for this application, the well-characterized human sample HG002 was prepared for sequencing using a Covaris shearing and PCR-free library preparation method and sequenced with 2 × 150-bp reads. The run generated 1.02 billion passing filter paired-end reads with a duplicate rate of 0.58% (0.11% classified as optical duplicates by Picard (https://broadinstitute.github.io/picard/)). To underscore the impact of low duplicates, we compared the number of input reads with genomic coverage (Extended Data Fig. 5).

A FASTQ file with the base calls and quality scores was downsampled to 35-fold coverage and used as an input into the DNAScope analysis pipeline from Sentieon. SNP and indel calls achieved F1 scores of 0.995 and 0.996, respectively. Extended Data Table 2 shows variant-calling performance for SNPs and small indels on the GIAB-HC regions. Sensitivity, precision and F1 scores are shown. The performance on SNPs and indels is comparable. Extended Data Fig. 6 shows the F1 score for SNPs and indels across all GiaB stratifications with at least 100 variants in the truth set.

Extensibility of avidity sequencing

To assess the extensibility of avidity chemistry we continued a sequencing run beyond 150 bp to generate a 1 × 300 dataset from an Escherichia coli library. To achieve this we used both an optimized polymerase and an optimized reagent formulation. Figure 6a shows quality scores as a function of sequencing cycle. Because quality scores were not trained to these lengths, the scores are approximate. Figure 6b shows the E. coli error rate as a function of cycle number based on alignment to the known reference strain. The error rate of the final cycle was 1.9% and that at cycle 150 was 0.1%. Error calculations were based on the vast majority of the data with a pass filter rate for the run of >99.6% and Burrows–Wheeler aligner (BWA) settings aimed at strongly discouraging soft clipping (no cycles with soft clipping >0.04%). The enzymes and formulations developed for this run will be leveraged as we continue to identify extensions and improvements.