Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Quantifying molecular bias in DNA data storage

## Abstract

DNA has recently emerged as an attractive medium for archival data storage. Recent work has demonstrated proof-of-principle prototype systems; however, very uneven (biased) sequencing coverage has been reported, which indicates inefficiencies in the storage process. Deviations from the average coverage in the sequence copy distribution can either cause wasteful provisioning in sequencing or excessive number of missing sequences. Here, we use millions of unique sequences from a DNA-based digital data archival system to study the oligonucleotide copy unevenness problem and show that the two paramount sources of bias are the synthesis and amplification (PCR) processes. Based on these findings, we develop a statistical model for each molecular process as well as the overall process. We further use our model to explore the trade-offs between synthesis bias, storage physical density, logical redundancy, and sequencing redundancy, providing insights for engineering efficient, robust DNA data storage systems.

## Introduction

Storing data in DNA is attractive due to its information density of petabytes of data per gram, and excellent durability1. Relative to other forms of molecular-level or atomic-level data storage, DNA is unique because of the ease of copying DNA (i.e., using PCR) and its eternal relevance (people will always be interested in sequencing DNA)2. High-throughput (HT) sequencing and synthesis technologies3,4 have evolved and made storing information in synthetic DNA an increasingly realistic alternative to traditional long-term storage methods5,6,7,8. However, the sequencing coverage (number of read counts of a unique sequence) of an oligonucleotide (henceforth referred to simply as “oligo”) was found to be very uneven, requiring modern error correction codes capable of handling sequence dropout7,8,9,10,11,12. Current methods typically require either trial-and-error of experimental protocols or brute-force use of hundreds to thousands of sequencing reads per sequence to capture underrepresented sequences. This inefficiency stems from lack of understanding about bias in oligo copy distribution, as well as how it changes as the oligos are manipulated in DNA data storage systems.

In more recent work, errors and bias were studied using sequencing data from DNA storage systems13. However, direct PCR and sequencing in a DNA storage system cannot distinguish bias created by DNA synthesis from bias caused by PCR and sequencing. As our first foray in separating bias effects stemming from DNA synthesis versus PCR, we tag an arbitrarily chosen DNA archival file with over 400,000 sequences using unique molecular identifiers (UMI), random barcodes to label each molecule14. UMI labeling allow us to decouple synthesis bias from PCR bias, and we find significant bias from DNA synthesis. To corroborate this finding, we order from Twist Bioscience a carefully designed ready-to-sequence pool with 1,536,168 sequences, each of which unique, and already containing necessary segments of DNA to be sequenced. This ready-to-sequence pool can be sequenced using an Illumina sequencer directly, with no need for intermediate PCR or DNA ligation steps required for sequencing library preparation. Thus we can quantify synthesis oligo distribution without any interference from molecular processes. To the best of our knowledge, this is the first time an oligo pool from array-based synthesis is characterized in this way. We find that synthesis bias is highly related to spatial location of oligos on a synthesis chip.

After quantifying synthesis bias, we study PCR bias from two sources—guanine/cytosine (GC) content and PCR stochasticity. GC content of individual sequences were previously found to affect PCR amplification efficiency in biological DNA15,16,17. In DNA storage, the GC content of each strand is determined by a data-to-DNA sequence encoder. We test GC bias using two different oligo pools: one pool is encoded to avoid all homopolymers (non-homopolymer pool); in contrast, the other is encoded without homopolymer avoidance steps (homopolymer pool). Even though these two encoding strategies lead to different GC distributions, we find no practically important association between GC content and PCR bias. Instead, we find that PCR stochasticity widens oligo copy distributions of our test DNA archival file and, based on our observations, seems to be a dominant factor in PCR bias. PCR is an exponential process, so small random variations early on in amplification can have a large impact on distribution18,19,20,21,22,23,24.

Based on these observations, we construct a computational model for predicting molecular bias in a DNA data storage system (Fig. 1). We observe strong association between the bias predicted from this model and from our experimental data. Furthermore, we use our model to investigate the tradeoffs between synthesis bias, physical redundancy for storing DNA (i.e., oligo copy number), logical redundancy (additional information to aid error correction and mitigate missing sequences), and sequencing redundancy (i.e., sequencing coverage). A system model can be very useful to determine the best parameters for a given DNA storage system.

## Results

### DNA synthesis is a prominent source of sequence bias

Determining the source of bias in DNA data storage, and more generally in arbitrary DNA pools, is complicated because synthesis bias and PCR bias are typically coupled. To decouple them, we applied UMI, barcodes to individually identify each molecule of an initial pool, in our case an arbitrarily chosen DNA file with over 400,000 sequences (Fig. 2a and Supplementary Fig. 1). Synthetic DNA pools include multiple copies of each sequence, and UMI labeling ensures with high probability that each molecule will include a tag different from any other. The UMI-labeled oligos were sequenced, and the resulting reads were aligned to the file sequences in two manners. First, these reads were aligned to individual sequences in the file using Burrows-Wheeler Aligner (BWA)25, independent from UMI, and their respective counts (coverage) are reported in Fig. 2c. Second, the same reads were aligned to sequences in the file, then further filtered by UMI label (Fig. 2b), and finally reported in Fig. 2d. The UMI-filtered results are a proxy for the oligo distribution after DNA synthesis, and the copy number is clearly variable, indicating that the old synthesis process is far more skewed (there have since been process improvements, discussed in the next section). This distribution is also very similar in shape to the distribution after PCR, indicating that PCR does not significantly increase bias overall. Nevertheless, PCR still has an impact on individual sequence counts, so we decided to study the amplification ratio of each sequence as a function of the number of initial molecules representing it. We define the amplification ratio to be the ratio of total reads after PCR to UMI count (i.e., oligo count before PCR) for each sequence. Figure 2e shows that regardless of the initial oligo copy number, the average amplification ratio remains constant. On the other hand, the amplification ratio was observed to have high variation when oligos had very low copy numbers, indicating that the amplification ratio was affected by stochastic effects at these low copy numbers. Indeed, since a PCR process is composed of successive rounds of binomially distributed copying (each molecule has some probability of being copied), we would expect the standard deviation (s.d.) of the amplification ratio to be inversely proportional to the square root of the initial number of strands. Additionally, since observation takes a sequencing reaction (another binomial process) we would expect a constant amount of added deviation. These observations lead us to the model:

$$\sigma _\alpha = \frac{a}{{\sqrt {{\mathrm{UMI}}\;{\mathrm{count}}} }} + b$$
(1)

where $$\sigma _\alpha$$ is the s.d. of the amplification ratio, and a and b are constants. Our experimental data was fitted using Eq. (1) and shown in Fig. 2f.

### Bias is related to the spatial location on the synthesis chip

To further understand the synthesis bias, we ordered a carefully designed a ready-to-sequence pool with 1,536,168 unique DNA sequences from Twist Bioscience. Oligos in this pool already contain universal Illumina adapters and Illumina sequencing primers on both the 5′ and 3′ ends, allowing us to sequence it without any sequencing library preparation such as PCR or ligation. By mapping the sequencing reads of each sequence back to its corresponding location on the synthesis chip, a distinct pattern can be observed (Fig. 3a), indicating that synthesis bias was related to the spatial location on the synthesis chip. After further discussion with Twist Bioscience, their synthesis process was improved, and the oligo counts on the synthesis chip became much more even (Fig. 3c). Interestingly, the oligo distribution before the synthesis process improvement did not follow a normal distribution, but the oligo distribution using the improved synthesis process is now well fitted to a normal distribution (Fig. 3b, d).

Oligo synthesis quality lives within a select set of parameters broadly defined as dosage, where we define dosage as: {time, temperature, and concentration}. The boundaries of this quality window define a dosage tolerance relevant to the particular application, in this case oligonucleotides used for data storage. Initial data used in this paper were produced with a process that allowed process excursions outside of the dosage tolerance window. To address these spatial gradients in quality, Twist Bioscience increased the dosage tolerance window with proprietary chemical modifications to the phosphoramidite chemistry, making the additive synthesis process less susceptible to process dosage excursions. Twist Bioscience also introduced engineering changes to the hardware and chemical process parameters to ensure more uniform evacuation of chemical reagents in the flowcell process with enhanced temporal control. The combination of these changes has resulted in dramatically decreased error rates and more robust molecules in subsequent processes.

### Population fraction change for quantifying PCR bias

We now turn to studying the PCR bias by creating metrics to quantify it at the sequence level. We begin by defining the population fraction of a sequence i after $$k \in \{ Z_{ \ge 0}\}$$ cycles of PCR as

$$x_i^{(k)}: = \frac{{N_i^{\left( k \right)}}}{{{\sum }_j N_j^{(k)}}}$$
(2)

where $$N_i^{\left( k \right)}$$ is the number of reads of sequence i after k PCR cycles. Here j is across all sequences. We then define the population fraction change for sequence i to be

$$Q_i = Q_i^{(k)}: = \frac{{x_i^{\left( k \right)}}}{{x_i^{\left( 0 \right)}}}.$$
(3)

We consider a PCR process to be unbiased when $${\Bbb E}[ {Q_i{\mathrm{|}}x_i^{( 0 )} \,> \,0} ] = 1$$ for all sequences, that is, no sequence becomes over or underrepresented after a PCR, whereas we consider a PCR process to be biased when $${\Bbb E}[ {Q_i{\mathrm{|}}x_i^{( 0 )} \,> \, 0} ]\, \ne \,1$$ for any sequence i. We then can say that experiments with higher standard deviation over the population fraction change, $${\mathrm{{s.d.}}} [ {\mathbf{{Q}}} :=\{ {Q_i} | {x_{i} ^{(0)}}\, > \, 0\}]$$, show more bias when all other conditions are equivalent. It is worth noting that even an unbiased process will have $${\mathrm{s}}.{\mathrm{d}}.\left[ {\mathbf{Q}} \right]\, > \, 0$$ for finite sample sizes. Furthermore, $${\mathrm{s}}.{\mathrm{d}}.[{\mathbf{Q}}]$$ should asymptotically decrease with the total number of reads.

### PCR bias is not correlated with GC content

Although previous studies observed PCR bias in genomic biological sample amplification15,16,17, it remained unclear whether such bias is significant in DNA data storage. To assess this, we used the 1.5 million-sequence ready-to-sequence pool and compared its distribution before PCR and after PCR. The ready-to-sequence pool was sequenced in two ways: (1) directly from the synthesized pool and (2) after one 6-cycle plus five 5-cycle PCR processes, for a total of 31 cycles. Each PCR process was limited to no more than 6 cycles to prevent resource exhaustion (i.e., there was always an excess of primer and other reagents). Sequencing data (Fig. 4a) shows qualitatively little change in the coefficient of variation (c.v.) of oligo copy distribution before and after PCR (0.41 and 0.45, respectively, when both are subsampled to 20× coverage). The two datasets were then compared at a sequence level by observing population fraction changes with respect to the overall available pre-PCR pool coverage, 60× (Fig. 4b). The distribution before PCR shows the effect of subsampling on population fraction, and the distribution after PCR shows the effect of PCR itself. The latter showed much higher standard deviation. The standard deviations of population fraction changes were 0.24 and 0.37 before PCR and after PCR, respectively, and these two numbers were statistically different (p < 0.005, computed by bootstrapping n = 1000). This indicates that PCR increased bias relative to a random sampling process.

We then asked whether population fraction changes were caused by GC content. We first examined the ready-to-sequence pool, which was encoded to avoid homopolymers6 (Fig. 4c). Although the association between population fraction changes and GC content of this pool (between 40 and 60%) was found to be statistically significant (P value < 0.05), the association between the two was very small and practically unimportant (the slope of the linear fit was <0.01). Additionally, we tested another 9 different DNA archival files with a total of 1,358,998 unique sequences that allow random homopolymers (Fig. 4d; Supplementary Fig. 2 shows experimental workflow details). These homopolymer files had a wider range of GC content from 25 to 75%, but the association between GC content and the population fraction changes was still very small and not practically important (the slope of the linear fit was <0.01). The negligible bias impact from GC content in our experimental data was likely because these oligos were relatively short (150-nt), and the use of KAPA HIFI polymerase also reduced the impact of GC bias26. Having established that GC content was not the main effect being observed, we turned to hypothesizing that PCR stochasticity was the culprit.

### PCR stochasticity can lead to significant bias

Because PCR is not perfect (i.e., replication of an individual molecule has a probability of less than one), even small random divergence in early phases of amplification can create significant bias, which is known as PCR stochastic bias. We have shown that PCR bias is related to oligo copy number in the UMI quantification experiment, especially for sequences with low copy numbers in the initial pool (from a previous PCR process or from a biased synthesis pool). Now we want to understand better how PCR stochastic bias affects our DNA storage system.

To quantify PCR stochastic bias, we used an arbitrarily chosen DNA pool with 7,373 sequences to perform a serial dilution-PCR experiment (Fig. 5a). The master pool was diluted to different average copy numbers ranging from 8 to 113 (the copy numbers were quantified using qPCR). Then each sample was amplified with 18 cycles of PCR using primers with Illumina sequencing primer overhangs. Subsequently, a second step of PCR was carried out to include the Illumina adapters where we adjusted the number of cycles to equalize the final library concentration (Supplementary Fig. 3 shows workflow details). The second PCR was carried out at high copy number of the templates (over a million oligo copies per sequence) to avoid introducing additional bias. Our experimental results show that as average copy number decreased, oligo distribution skewed further away from its mean (Fig. 5c). We plot average copy number in a pre-PCR mix against the coefficient of variation (c.v.) of sequencing coverage (Fig. 5d) and standard deviation of population fraction change Q (Supplementary Fig. 4). Both plots show that the lower oligo copy numbers were, the greater the PCR stochastic bias was.

### A computational model can predict molecular bias

After characterizing the bias caused by synthesis and PCR sequencing retrieval, we construct a DNA storage model that encompasses the entire workflow of DNA storage, starting from synthesis → aliquot into pre-PCR reaction → PCR amplification with k cycles → sequencing with mean $$\bar n_r$$ reads (Fig. 5b). We model the oligo copy distribution of synthesis as a normal distribution with total number of sequences Nseq, mean copy number per sequence $$\bar n_{syn}$$, and standard deviation of oligo copy number σ. The PCR process is modeled as a stochastic branching process using the following recursive equation:

$$n_{j + 1} = n_j + B( {n_j,\;P} )$$
(4)

where nj is the number of molecules in the j-th cycle; B(nj, P) is a binomially distributed random variable with nj molecules, and P is the probability of a successful amplification. Illumina sequencing was previously observed to have bias on GC-extreme sequences15,27,28, but GC content in our files did not show practically significant bias in the PCR GC bias test. Therefore, high-throughput sequencing and sample dilution are modeled using random sampling. Note that for performance reasons our model does not perform stochastic simulation for high copy number PCR. PCR carried out at high copy number of templates should obey the law of mass action and therefore be effectively deterministic.

We then interrogated our computational model to determine whether it can estimate the bias observed in the serial dilution-PCR experiment. Despite not being able to observe the oligo population directly after synthesis, our UMI experiment (Fig. 2) has provided evidence that its population distribution is quite similar to the distribution resulting from a PCR process that starts from a large average copy count sample coming from that synthesized pool. As such, the copy distribution of a synthesis pool is modeled as a normal distribution with the same c.v. as the experimental data from the (optimized) ready-to-sequence pool. Then we used our system model to simulate the dilution-PCR experiment. Figure 5d shows that our model prediction is in good agreement with the c.v. of the experimental data (R2 = 0.71). The model also predicted the trend of standard deviation of population fraction change Q: the lower starting copy number in the PCR showed higher standard deviation (R2 = 0.84; Supplementary Fig. 4).

### A computational model can help determine system parameters

Taking it one step further, we used our computational model to study a range of parameters associated with DNA storage: synthesis bias, physical redundancy for storing DNA, logical redundancy, and sequencing redundancy (Fig. 6a). In particular, we investigated the impact of these parameters on sequence dropout rate, which is critical for error-free decoding. Figure 6b plots sequence dropout rates as a function of the c.v. of a synthesis pool and sequencing reads. It shows that a biased synthesis pool (i.e., high c.v.) is the dominant factor in sequence dropout and cannot be proportionally compensated by additional sequencing reads. Sequence dropout is caused by physical storage with a limited number of oligo copies coupled with PCR stochastic bias. Figure 6c plots sequence dropout rates as a function of the copy number of stored DNA and sequencing reads. It shows that physical storage density is a more important factor than sequencing reads in modulating sequence dropout. Interestingly, our model estimates that it is possible to store as few as 10 copies per oligo sequence (physical density of 9.3 EB per g - EB: exabytes; 1018 bytes), while achieving less than 2% sequence dropout. This estimated physical density is over 10-fold higher than prior work by Erlich and Zielinski10 and is aligned with what we have recently observed in practice29. The next important question is how much logical redundancy is needed to handle missing sequences. Take Reed-Solomon code in our previous work8 as an example, the maximum percentage of missing strands that can be tolerated is $$\frac{R}{{100\, +\, R}}$$%, where R is the percentage of logical redundancy. Figure 6d shows a simulation of how oligo copy number affects the recovered oligo percentage (100% minus oligo dropout) and the required logical redundancy to recover the data. Interestingly, at the low end, a modest increase in logical redundancy allows for a significant decrease in the required oligo copy number and enables an almost proportional increase in physical density. For example, at 30 copies, the required logical redundancy for data recovery is 3% whereas at 10 copies the logical redundancy grows to only 8%, nearly tripling physical density. It is worth pointing out the example here ignores all other errors such as insertions, deletions, and substitutions. These errors depend on synthesis and sequencing technologies, and they should be taken into account when determining the proper logical redundancy. Finally, we give an example in Supplementary Figure 5 to show how our system model can be used to optimize two important parameters in a given DNA storage system: physical redundancy (determining physical density) and sequencing redundancy (determining sequencing cost).

## Discussion

In this work, we quantified molecular bias in a DNA storage system, and we identified two significant bias sources: synthesis bias and PCR stochastic bias. Synthesis bias was found to be related to the spatial location on the synthesis chip, and this observation was later used to inform and improve the synthesis process. PCR stochastic bias was identified as the second main driver of oligo copy variation. Indeed, prior work also found that PCR copy data from a deeply diluted oligo pool resulted in dramatic bias, which is less suitable for data recovery10.

Another important contribution of this manuscript is the construction of the first process-wide model that provides a quantitative understanding of how oligo copy distribution is skewed as it goes through a DNA storage system. Importantly, such system model helps researchers rationally optimize the use of DNA physical density, logical redundancy, and sequencing redundancy for reliable data decoding without conducting hundreds of experimental trials. We believe this is an important step towards engineering robust, efficient DNA storage systems.

In this study, we found that oligos from unbiased synthesis and sequencing processes can be well modeled as a normal distribution and random sampling, respectively. While the experiments were tested using Twist Bioscience and Illumina sequencing, the proposed system model can in principle be applied to other synthesis and sequencing technologies. It is worth noting that when applying our model to other technologies, additional quantification and modeling is likely needed. For example, Ion Torrent and Oxford Nanopore sequencing show limited ability to accurately sequence long homopolymers, which is less significant in Illumina sequencing. Different array-based synthesis technologies could also have their own unique dependent bias caused by processes specific to them, such as uneven fluidic operation, surface treatment, and other factors. Our system model was experimentally tested by PCR-amplifying a single file without any other non-targeted files in a pool. This experiment was designed to avoid complexity from other files for proper quantification of the impact of PCR stochastic bias. Next, we plan to investigate whether PCR random access of a file from a complex pool with additional files will lead to more bias. We suspect that amplifying a very small file from a complex pool with relatively large number of sequences will exhibit more copy number variation due to non-specific binding of primers. New methods will probably be needed for such system.

## Methods

### Reagents

All DNA pools were synthesized by Twist Bioscience (San Francisco, CA). All DNA pools were resuspended to 10 ng per µL in 1× TE buffer (pH 7.5). All primers were purchased as desalted, unpurified DNA from Integrated DNA Technologies (IDT; Coralville, IA). All primers were resuspended to 100 μM in 1× TE buffer (pH 7.5). KAPA HIFI polymerase was purchased from Kapa Biosystems. T4 ligase and T4 Polynucleotide Kinase (T4 PNK) were purchased from New England Lab.

### PCR protocol

In a 20 µL PCR reaction, 1 µL of 1 ng per µL of ssDNA pool was mixed 1 µL of 10 μM of the forward primer and 1 µL of 10 μM of the reverse primer, 10 µL of 2× KAPA HIFI enzyme mix, and 7 µL of molecular biograde water. The reaction followed a thermal protocol: (1) 95 °C for 3 min, (2) 98 °C for 20 s, (3) 62 °C for 20 s, (4) 72 °C for 15 s. After PCR, the length of the PCR products was confirmed using a Qiaxcel fragment analyzer, and the sample concentration was measured using a Qubit 3.0 fluorometer. Primer sequences see Supplementary Table 2.

### Sample preparation for sequencing

Before sequencing, the concentrations of all samples were quantified using qPCR. The final sample was then prepared for sequencing by following the NextSeq System Denature and Dilute Libraries Guide. The final concentration of the loaded sample for our Illumina NextSeq is 1.3 pM, and a 10–20% PhiX was spiked in as a control (PhiX is a genomic DNA sample provided by Illumina).

### Protocols of UMI labeling

The general workflow for UMI labeling of a single-stranded DNA pool is divided into 5 steps (Supplementary Fig. 1; sequences see Supplementary Table 1): (1) phosphorylation of a ssDNA pool and Illumina P7 adapters, (2) assembly of a ssDNA pool with Illumina adapters with DNA staples by heat annealing, (3) ligation of Illumina adapters to the ssDNA pool, (4) extraction of the ligated sample using denaturing polyacrylamide gel electrophoresis (D-PAGE), and finally (5) PCR enrichment of the full length product.

The phosphorylation of ssDNA was performed using the following recipe: 5 pmole of the single-stranded DNA pool, 20 units of T4 Polynucleotide Kinase (T4 PNK), 1 µL of 10× T4 ligase buffer and 1 µL of 10× T4 PNK buffer were mixed in a 10 µL total volume reaction. 500 pmole of single-stranded Illumina P7 adapter, 200 units of T4 PNK, 5 µL of 10× T4 ligase buffer and 5 µL of 10× T4 PNK buffer were mixed in a 50 µL total volume reaction. The mixtures were incubated for 30 min at 37 °C.

The assembly of the single-stranded DNA pool with adapters were performed with the following recipe: In a 25 µL reaction, 15 pmole of single-stranded DNA pool, 30 pmole of DNA staples and 45 pmole of Illumina P5 and P7 adapters were mixed. The mixture was heated up to 95 °C for 2 min, and then cooled down to 25 °C at a rate of 1 degree per minute.

Ligation of DNA was performed with a 15 µL reaction in which 10 µL of the assembled DNA mixture, 2 µL of the T4 ligase (10 units per µL), 1.5 µL of T4 ligase buffer and 1.5 µL of molecular water were mixed. The ligation mixture was incubated at room temperature for 30 min, followed by heat inactivation at 65 °C for 10 min.

A 10% D-PAGE gel was made by mixing 2.5 mL of 19:1 40% acrylamide/bus, 1.2 mL of 10× TBE, 5.04 g of urea and deionized water to 12 mL. Then 72 µL APS and 4.8 µL of TEMED were added to help polymerization. DNA sample was mixed with 2× TBE/Urea denaturing loading buffer (Bio-Rad). Gels were run at 200 V for 55 min at 55 °C. The extracted band was incubated with 1× TE buffer overnight at room temperature for elution.

The eluted single-stranded DNA was PCR-amplified using the end primers of Illumina adapters. The PCR reaction used 1 µL of the eluted single-stranded DNA, 10 pmole of the forward and reverse primers, 10 µL of 2× KAPA HIFI polymerase and 8 µL of molecular water. The thermal protocol is as follows: (1) 95 °C for 3 min, (2) 98 °C for 20 s, (3) 60 °C for 20 s, (4) 72 °C for 15 s.

### Sequence alignment using Burrows-Wheeler Aligner (BWA)

We used BWA to align our expected, short references against reads from a sequencer. We then used the alignment counts for each reference sequence produced by BWA to generate distribution plots.

### Density histogram plots

The y-axis of a density histogram shows probability density, and the area (or integral) under the histogram is 1. The probability density di is calculated by dividing the count by the sample size times its bin width (see the following equation).

$$d_i = \frac{{N_i}}{{\left( {{\sum }_j N_j} \right) \ast W_i}}$$
(5)

where Ni is the count of the i-th bar, and Wi is the bin width of the i-th bar. Displaying the y-axis as probability density makes it possible to compare distributions. In Fig. 5c, a Gaussian estimated curve is added to help visualize each histogram.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

The sequencing coverage data underlying Figs. 2c–f, 3b, d, 4a, b, and 5 are available in the Source Data file and are on GitHub at this URL: https://github.com/uwmisl/storage-biasing-ncomms20. Any additional data will be made available upon reasonable request. Source data are provided with this paper.

## Code availability

The analysis code that supports the findings of this study is available upon request/is on GitHub at this URL: https://github.com/uwmisl/storage-biasing-ncomms20. The code simulates the whole molecular process and was used to demonstrate how synthesis bias, physical redundancy, and sequencing redundancy affect the sequence dropout rate.

## References

1. 1.

Zhirnov, V., Zadegan, R. M., Sandhu, G. S., Church, G. M. & Hughes, W. L. Nucleic acid memory. Nat. Mater. 15, 366–370 (2016).

2. 2.

Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).

3. 3.

Cox, J. P. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).

4. 4.

Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).

5. 5.

Church, G. M., Gao, Y. & Kosuri, S. Next-Generation Digital Information Storage in DNA. Science 337, 1628–1628 (2012).

6. 6.

Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).

7. 7.

Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555 (2015).

8. 8.

Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).

9. 9.

Yazdi, S. M. H. T., Yuan, Y., Ma, J., Zhao, H. & Milenkovic, O. A rewritable, random-access DNA-based storage system. Sci. Rep. 5, 14138 (2015).

10. 10.

Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).

11. 11.

Bornholt, J. et al. A DNA-based archival storage system. ACM SIGOPS Oper. Syst. Rev. 50, 637–649 (2016).

12. 12.

Yazdi, S. M. H. T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep. 7, 5011 (2017).

13. 13.

Heckel, R., Mikutis, G. & Grass, R. N. A characterization of the DNA data storage channel. Sci. Rep. 9, 9663 (2019).

14. 14.

Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2012).

15. 15.

Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

16. 16.

Dabney, J. & Meyer, M. Length and GC-biases during sequencing library amplification: a comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques 52, 87–94 (2012).

17. 17.

Aird, D. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18 (2011).

18. 18.

Jagers, P. & Klebaner, F. Random variation and concentration effects in PCR. J. Theor. Biol. 224, 299–304 (2003).

19. 19.

Stolovitzky, G. & Cecchi, G. Efficiency of DNA replication in the polymerase chain reaction. Proc. Natl Acad. Sci. USA 93, 12947–12952 (1996).

20. 20.

Hassibi, A., Kakavand, H. & Lee, T. A stochastic model and simulation algorithm for polymerase chain reaction (PCR) systems. In Proc. of IEEE Workshop on Genomics Signal Processing and Statistics (IEEE, 2004).

21. 21.

Piau, D. Confidence intervals for nonhomogeneous branching processes and polymerase chain reactions. Ann. Probab. 33, 674–702 (2005).

22. 22.

Lalam, N., Jacob, C. & Jagers, P. Modelling the PCR amplification process by a size-dependent branching process and estimation of the efficiency. Adv. Appl. Probab. 36, 602–615 (2004).

23. 23.

Peccoud, J. & Jacob, C. Theoretical uncertainty of measurements using quantitative polymerase chain reaction. Biophys. J. 71, 101–108 (1996).

24. 24.

Kebschull, J. M. & Zador, A. M. Sources of PCR-induced distortions in high-throughput sequencing data sets. Nucleic Acids Res. 43, e143 (2015).

25. 25.

Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

26. 26.

Quail, M. A. et al. Optimal enzymes for amplifying sequencing libraries. Nat. Methods 9, 10–11 (2012).

27. 27.

Chen, Y., Liu, T., Yu, C., Chiang, T. & Hwang, C. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8, e62856 (2013).

28. 28.

Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).

29. 29.

Organick, L. et al. Probing the physical limits of reliable DNA data retrieval. Nat. Commun. 11, 1–7 (2020).

## Acknowledgements

We would like to thank Patrick Finn, Siyuan Chen, Andrew Stewart, Bernadette Arias, and Emily Leproust from Twist Bioscience for supplying us with the DNA and for suggestions on the analysis of DNA synthesis data. We also thank Leila Zelnick and Jeff Nivala for discussion and comments on the paper.

## Author information

Authors

### Contributions

Y-J.C. designed, performed, and analyzed experiments. C.T. and C.B. designed experiments and analyzed data. L.O. performed experiments. S.D.A. ran the DNA sequence encoder and decoder, and analyzed data. P.W. and B.P. contributed improvements to the DNA synthesis process. G.S., K.S., and L.C. directed and supervised the work.

### Corresponding authors

Correspondence to Yuan-Jyue Chen or Luis Ceze or Karin Strauss.

## Ethics declarations

### Competing interests

Y-J.C., S.D.A., and K.S. are or were Microsoft employees while this project was conducted. P.W. and B.P. are Twist Bioscience employees. All other authors declare no competing interests.

Peer review information Nature Communications thanks Stephen Yip and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Chen, YJ., Takahashi, C.N., Organick, L. et al. Quantifying molecular bias in DNA data storage. Nat Commun 11, 3264 (2020). https://doi.org/10.1038/s41467-020-16958-3

• Accepted:

• Published:

• ### CLGBO: An Algorithm for Constructing Highly Robust Coding Sets for DNA Storage

• Yanfen Zheng
• , Jieqiong Wu
•  & Bin Wang

Frontiers in Genetics (2021)

• ### Targeted Genome Sequencing (TG-Seq) Approaches to Detect Plant Viruses

• Solomon Maina
• , Linda Zheng
•  & Brendan C. Rodoni

Viruses (2021)

• ### Novel Modalities in DNA Data Storage

• Cheng Kai Lim
• , Saurabh Nirantar
• , Wen Shan Yew
•  & Chueh Loo Poh

Trends in Biotechnology (2021)

• ### Decoding DNA data storage for investment

• Philip M. Stanley
• , Lisa M. Strittmatter
• , Alice M. Vickers
•  & Kevin C.K. Lee

• ### A mixed culture of bacterial cells enables an economic DNA storage on a large scale

• Min Hao
• , Hongyan Qiao
• , Yanmin Gao
• , Zhaoguan Wang
• , Xin Qiao
• , Xin Chen
•  & Hao Qi

Communications Biology (2020)