Random access in large-scale DNA data storage

Organick, Lee; Ang, Siena Dumas; Chen, Yuan-Jyue; Lopez, Randolph; Yekhanin, Sergey; Makarychev, Konstantin; Racz, Miklos Z; Kamath, Govinda; Gopalan, Parikshit; Nguyen, Bichlien; Takahashi, Christopher N; Newman, Sharon; Parker, Hsing-Yeh; Rashtchian, Cyrus; Stewart, Kendall; Gupta, Gagan; Carlson, Robert; Mulligan, John; Carmean, Douglas; Seelig, Georg; Ceze, Luis; Strauss, Karin

doi:10.1038/nbt.4079

Article
Published: 19 February 2018

Random access in large-scale DNA data storage

Lee Organick¹,
Siena Dumas Ang²,
Yuan-Jyue Chen²,
Randolph Lopez³,
Sergey Yekhanin²,
Konstantin Makarychev²^nAff5,
Miklos Z Racz²^nAff5,
Govinda Kamath²^nAff5,
Parikshit Gopalan²^nAff5,
Bichlien Nguyen²,
Christopher N Takahashi¹,
Sharon Newman¹^nAff5,
Hsing-Yeh Parker²,
Cyrus Rashtchian²,
Kendall Stewart¹,
Gagan Gupta²,
Robert Carlson²,
John Mulligan²,
Douglas Carmean²,
Georg Seelig^1,4,
Luis Ceze¹ &
…
Karin Strauss²

Nature Biotechnology volume 36, pages 242–248 (2018)Cite this article

27k Accesses
393 Citations
339 Altmetric
Metrics details

Subjects

An Erratum to this article was published on 06 July 2018

This article has been updated

Abstract

Synthetic DNA is durable and can encode digital data with high density, making it an attractive medium for data storage. However, recovering stored data on a large-scale currently requires all the DNA in a pool to be sequenced, even if only a subset of the information needs to be extracted. Here, we encode and store 35 distinct files (over 200 MB of data), in more than 13 million DNA oligonucleotides, and show that we can recover each file individually and with no errors, using a random access approach. We design and validate a large library of primers that enable individual recovery of all files stored within the DNA. We also develop an algorithm that greatly reduces the sequencing read coverage required for error-free decoding by maximizing information from all sequence reads. These advances demonstrate a viable, large-scale system for DNA data storage and retrieval.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Overview of the DNA data storage workflow and stored data.**

**Figure 2: Design of random access primers and coding algorithm.**

**Figure 3: Experimental error analysis and decoding, sequencing using Illumina's NextSeq.**

**Figure 4: Sequencing using Oxford Nanopore Technologies' MinION.**

Probing the physical limits of reliable DNA data retrieval

Article Open access 30 January 2020

Quantifying molecular bias in DNA data storage

Article Open access 29 June 2020

Reading and writing digital data in DNA

Article 29 November 2019

Change history

06 March 2018
In the version of this article initially published, the references in the reference list were in the wrong order; the references have been renumbered as follows: 3 as 2; 5 as 3; 6 as 8; 7 as 9; 8 as 11; 9 as 6; 10 as 12; 11 as 5; 12 as 13; 13 as 7; 16 as 10; and no. 2, “Hoch, J.A. & Losick, R. Panspermia, spores and the Bacillus subtilis genome. Nature 390, 237–238 (1997),” has been deleted. In addition, on p.242, end of paragraph 2, the citation in “experiments⁷” has been deleted. The errors have been corrected in the HTML and PDF versions of the article.

References

Neiman, M.S. On the molecular memory systems and the directed mutations. Radiotekhnika 6, 1–8 (1965).
Google Scholar
Cox, J.P.L. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).
Article CAS Google Scholar
Church, G.M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012).
Article CAS Google Scholar
Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).
Article CAS Google Scholar
Grass, R.N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W.J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555 (2015).
Article CAS Google Scholar
Blawat, M. et al. Forward error correction for DNA data storage. Procedia Comput. Sci. 80, 1011–1022 (2016).
Article Google Scholar
Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
Article CAS Google Scholar
Yazdi, S.M.H.T., Yuan, Y., Ma, J., Zhao, H. & Milenkovic, O. A rewritable, random-access DNA-based storage system. Sci. Rep. 5, 14138 (2015).
Article Google Scholar
Bornholt, J. et al. in Proc. Int. Conf. ASPLOS. 637–649 (ACM, 2016).
Yazdi, S.M.H.T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep. 7, 5011 (2017).
Article Google Scholar
Kosuri, S. & Church, G.M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
Article CAS Google Scholar
Xu, Q., Schlabach, M.R., Hannon, G.J. & Elledge, S.J. Design of 240,000 orthogonal 25mer DNA barcode probes. Proc. Natl. Acad. Sci. USA 106, 2289–2294 (2009).
Article CAS Google Scholar
Batu, T., Kannan, S., Khanna, S. & McGregor, A. Reconstructing strings from random traces. Proc. Fifteenth Annu. ACM-SIAM SODA'04. 2004, 910–918 (2004).
Google Scholar
Pellicer, J., Fay, M.F. & Leitch, I.J. The largest eukaryotic genome of them all? Bot. J. Linn. Soc. 164, 10–15 (2010).
Article Google Scholar
Zadeh, J.N. et al. NUPACK: analysis and design of nucleic acid systems. J. Comput. Chem. 32, 170–173 (2011).
Article CAS Google Scholar

Download references

Acknowledgements

We would like to thank B. Peck, P. Finn, S. Chen, A. Stewart, B. Arias, and E. Leproust from Twist Bioscience for supplying the DNA, suggesting protocol refinements, and offering input to our data analysis. We also thank J. Bornholt, K. D'Silva, and A. Levskaya for their help in the early stages of this project, and Y. Chou for her help in preparing samples for distribution. This work was supported in part by a sponsored research agreement by Microsoft, NSF award CCF-1409831 to L.C. and G.S. and by NSF award CCF-1317653 to G.S.

Author information

Konstantin Makarychev, Miklos Z Racz, Govinda Kamath, Parikshit Gopalan & Sharon Newman
Present address: Present addresses: VMware, Palo Alto, California, USA (P.G.); Stanford University, Stanford, California, USA (G.K. and S.N.); Northwestern University, Evanston, Illinois, USA (K.M.); Princeton University, Princeton, New Jersey, USA (M.Z.R.).,

Authors and Affiliations

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
Lee Organick, Christopher N Takahashi, Sharon Newman, Kendall Stewart, Georg Seelig & Luis Ceze
Microsoft Research, Redmond, Washington, USA
Siena Dumas Ang, Yuan-Jyue Chen, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Govinda Kamath, Parikshit Gopalan, Bichlien Nguyen, Hsing-Yeh Parker, Cyrus Rashtchian, Gagan Gupta, Robert Carlson, John Mulligan, Douglas Carmean & Karin Strauss
Department of Bioengineering Department, University of Washington, Seattle, Washington, USA
Randolph Lopez
Department of Electrical Engineering, University of Washington, Seattle, Washington, USA
Georg Seelig

Authors

Lee Organick
View author publications
You can also search for this author in PubMed Google Scholar
Siena Dumas Ang
View author publications
You can also search for this author in PubMed Google Scholar
Yuan-Jyue Chen
View author publications
You can also search for this author in PubMed Google Scholar
Randolph Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Yekhanin
View author publications
You can also search for this author in PubMed Google Scholar
Konstantin Makarychev
View author publications
You can also search for this author in PubMed Google Scholar
Miklos Z Racz
View author publications
You can also search for this author in PubMed Google Scholar
Govinda Kamath
View author publications
You can also search for this author in PubMed Google Scholar
Parikshit Gopalan
View author publications
You can also search for this author in PubMed Google Scholar
Bichlien Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Christopher N Takahashi
View author publications
You can also search for this author in PubMed Google Scholar
Sharon Newman
View author publications
You can also search for this author in PubMed Google Scholar
Hsing-Yeh Parker
View author publications
You can also search for this author in PubMed Google Scholar
Cyrus Rashtchian
View author publications
You can also search for this author in PubMed Google Scholar
Kendall Stewart
View author publications
You can also search for this author in PubMed Google Scholar
Gagan Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Robert Carlson
View author publications
You can also search for this author in PubMed Google Scholar
John Mulligan
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Carmean
View author publications
You can also search for this author in PubMed Google Scholar
Georg Seelig
View author publications
You can also search for this author in PubMed Google Scholar
Luis Ceze
View author publications
You can also search for this author in PubMed Google Scholar
Karin Strauss
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.O., Y.J.C., and R.L. designed protocols and performed experiments. S.Y., S.D.A., K.M., M.Z.R., C.R., and P.G. designed and implemented the encoding and decoding pipeline. S.D.A., M.Z.R., G.K., Ke.S., and C.N.T., collected and analyzed data. B.N., C.N.T., S.N., G.G., H.Y.P., R.C., and J.M. assisted in designing and evaluating experiments. D.C., G.S., L.C., and Ka.S. designed experiments, analyzed data and supervised the work.

Corresponding authors

Correspondence to Luis Ceze or Karin Strauss.

Ethics declarations

Competing interests

S.D.A., Y.-J.C., S.Y., K.M., M.Z.R., G.K., P.G., B.N., H.-Y.P., C.R., G.G., R.C., J.M., D.C., and K.S. are or were employees at Microsoft Research.

Integrated supplementary information

Supplementary Figure 1 Primer sequence design.

(a) Method for primer design. A random 20-mer continues to mutate until it satisfies the design criteria explained above. After satisfying these criteria, the primer is filtered by secondary structure and melting temperature. After generating a library of primers, the library is screened using BLAST to further improve sequence orthogonality. (b) Example of scoring for a primer. If the primer violates a design criterion, all bases related to the violation receive a +1 score.

Supplementary Figure 2 Primer library scalability estimates.

(a) The total number of primer pairs that pass the selection criteria described in Supplementary Fig. 1 (y-axis) increases with the log of the number of starting random 20-mers (x-axis, logarithmic scale). The six blue dots are primer libraries generated from different numbers of starting random 20-mers. (b) The ratio of primers passing the primer-payload collision detection algorithm described in Supplementary Fig. 4 (y-axis) decreases as the log of the amount of information to be stored increases (x-axis, logarithmic scale). Blue points represent the average passing ratio of the six different primer libraries generated in a. Error bars indicate standard deviation calculated from these six primer libraries. The measures of the centre indicate mean calculated from these six primer libraries.

Supplementary Figure 3 Randomization algorithm.

Digital data are iteratively randomized to reduce collisions between primers and payloads.

Supplementary Figure 4 Comparing amplification of files with and without collision detection and our primer design method.

All 15 bp traces are size markers used by the Qiagen QIAxcel system. The same instrument and profile was used for each trace. Each trace is a representative of three independent trials with virtually identical results. (a) “Simple conditions” indicates single-file pools. i. Trace of an amplified file designed with collision detection. ii. Trace of an amplified file designed without collision detection. (b) “Complex conditions” indicates multi-file pools. i. Trace of an amplified file designed with collision detection (9-file pool; amplified file is 17.4% of the pool). ii. Trace of an amplified file designed without using collision detection (6-file pool; amplified file is 18.0% of the pool).

Supplementary Figure 5 Random access and library preparation layout for sequencing.

First, random access regions are used to select files for sequencing. Through ePCR, a 25N region is added to the oligos to improve nucleotide diversity. Then, samples are ligated to Illumina sequencing adaptors with modified Illumina TruSeq Nano kit protocol. Finally, prepared samples are sequenced on an Illumina NextSeq instrument with a 10%-20% PhiX spike-in.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Organick, L., Ang, S., Chen, YJ. et al. Random access in large-scale DNA data storage. Nat Biotechnol 36, 242–248 (2018). https://doi.org/10.1038/nbt.4079

Download citation

Received: 13 July 2017
Accepted: 11 January 2018
Published: 19 February 2018
Issue Date: March 2018
DOI: https://doi.org/10.1038/nbt.4079

This article is cited by

DNA as a universal chemical substrate for computing and data storage
- Shuo Yang
- Bas W. A. Bögels
- Tom F. A. de Greef
Nature Reviews Chemistry (2024)
Reconstruction algorithms for DNA-storage systems
- Omer Sabary
- Alexander Yucovich
- Eitan Yaakobi
Scientific Reports (2024)
Recent Progress in High-Throughput Enzymatic DNA Synthesis for Data Storage
- David Baek
- Sung-Yune Joe
- Honggu Chun
BioChip Journal (2024)
Modelling for Efficient Scientific Data Storage Using Simple Graphs in DNA
- Asad Usmani
- Lena Wiese
SN Computer Science (2024)
In-vitro validated methods for encoding digital data in deoxyribonucleic acid (DNA)
- Golam Md Mortuza
- Jorge Guerrero
- Tim Andersen
BMC Bioinformatics (2023)