Abstract

Synthetic DNA is durable and can encode digital data with high density, making it an attractive medium for data storage. However, recovering stored data on a large-scale currently requires all the DNA in a pool to be sequenced, even if only a subset of the information needs to be extracted. Here, we encode and store 35 distinct files (over 200 MB of data), in more than 13 million DNA oligonucleotides, and show that we can recover each file individually and with no errors, using a random access approach. We design and validate a large library of primers that enable individual recovery of all files stored within the DNA. We also develop an algorithm that greatly reduces the sequencing read coverage required for error-free decoding by maximizing information from all sequence reads. These advances demonstrate a viable, large-scale system for DNA data storage and retrieval.

  • Subscribe to Nature Biotechnology for full access:

    $250

    Subscribe

Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.

Change history

  • Corrected online 06 March 2018

    In the version of this article initially published, the references in the reference list were in the wrong order; the references have been renumbered as follows: 3 as 2; 5 as 3; 6 as 8; 7 as 9; 8 as 11; 9 as 6; 10 as 12; 11 as 5; 12 as 13; 13 as 7; 16 as 10; and no. 2, “Hoch, J.A. & Losick, R. Panspermia, spores and the Bacillus subtilis genome. Nature 390, 237–238 (1997),” has been deleted. In addition, on p.242, end of paragraph 2, the citation in “experiments7” has been deleted. The errors have been corrected in the HTML and PDF versions of the article.

References

  1. 1.

    On the molecular memory systems and the directed mutations. Radiotekhnika 6, 1–8 (1965).

  2. 2.

    Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).

  3. 3.

    , & Next-generation digital information storage in DNA. Science 337, 1628 (2012).

  4. 4.

    et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).

  5. 5.

    , , , & Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555 (2015).

  6. 6.

    et al. Forward error correction for DNA data storage. Procedia Comput. Sci. 80, 1011–1022 (2016).

  7. 7.

    & DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).

  8. 8.

    , , , & A rewritable, random-access DNA-based storage system. Sci. Rep. 5, 14138 (2015).

  9. 9.

    et al. in Proc. Int. Conf. ASPLOS. 637–649 (ACM, 2016).

  10. 10.

    , & Portable and error-free DNA-based data storage. Sci. Rep. 7, 5011 (2017).

  11. 11.

    & Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).

  12. 12.

    , , & Design of 240,000 orthogonal 25mer DNA barcode probes. Proc. Natl. Acad. Sci. USA 106, 2289–2294 (2009).

  13. 13.

    , , & Reconstructing strings from random traces. Proc. Fifteenth Annu. ACM-SIAM SODA'04. 2004, 910–918 (2004).

  14. 14.

    , & The largest eukaryotic genome of them all? Bot. J. Linn. Soc. 164, 10–15 (2010).

  15. 15.

    et al. NUPACK: analysis and design of nucleic acid systems. J. Comput. Chem. 32, 170–173 (2011).

Download references

Acknowledgements

We would like to thank B. Peck, P. Finn, S. Chen, A. Stewart, B. Arias, and E. Leproust from Twist Bioscience for supplying the DNA, suggesting protocol refinements, and offering input to our data analysis. We also thank J. Bornholt, K. D'Silva, and A. Levskaya for their help in the early stages of this project, and Y. Chou for her help in preparing samples for distribution. This work was supported in part by a sponsored research agreement by Microsoft, NSF award CCF-1409831 to L.C. and G.S. and by NSF award CCF-1317653 to G.S.

Author information

Author notes

    • Konstantin Makarychev
    • , Miklos Z Racz
    • , Govinda Kamath
    • , Parikshit Gopalan
    •  & Sharon Newman

    Present addresses: VMware, Palo Alto, California, USA (P.G.); Stanford University, Stanford, California, USA (G.K. and S.N.); Northwestern University, Evanston, Illinois, USA (K.M.); Princeton University, Princeton, New Jersey, USA (M.Z.R.).

Affiliations

  1. Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA.

    • Lee Organick
    • , Christopher N Takahashi
    • , Sharon Newman
    • , Kendall Stewart
    • , Georg Seelig
    •  & Luis Ceze
  2. Microsoft Research, Redmond, Washington, USA.

    • Siena Dumas Ang
    • , Yuan-Jyue Chen
    • , Sergey Yekhanin
    • , Konstantin Makarychev
    • , Miklos Z Racz
    • , Govinda Kamath
    • , Parikshit Gopalan
    • , Bichlien Nguyen
    • , Hsing-Yeh Parker
    • , Cyrus Rashtchian
    • , Gagan Gupta
    • , Robert Carlson
    • , John Mulligan
    • , Douglas Carmean
    •  & Karin Strauss
  3. Department of Bioengineering Department, University of Washington, Seattle, Washington, USA.

    • Randolph Lopez
  4. Department of Electrical Engineering, University of Washington, Seattle, Washington, USA.

    • Georg Seelig

Authors

  1. Search for Lee Organick in:

  2. Search for Siena Dumas Ang in:

  3. Search for Yuan-Jyue Chen in:

  4. Search for Randolph Lopez in:

  5. Search for Sergey Yekhanin in:

  6. Search for Konstantin Makarychev in:

  7. Search for Miklos Z Racz in:

  8. Search for Govinda Kamath in:

  9. Search for Parikshit Gopalan in:

  10. Search for Bichlien Nguyen in:

  11. Search for Christopher N Takahashi in:

  12. Search for Sharon Newman in:

  13. Search for Hsing-Yeh Parker in:

  14. Search for Cyrus Rashtchian in:

  15. Search for Kendall Stewart in:

  16. Search for Gagan Gupta in:

  17. Search for Robert Carlson in:

  18. Search for John Mulligan in:

  19. Search for Douglas Carmean in:

  20. Search for Georg Seelig in:

  21. Search for Luis Ceze in:

  22. Search for Karin Strauss in:

Contributions

L.O., Y.J.C., and R.L. designed protocols and performed experiments. S.Y., S.D.A., K.M., M.Z.R., C.R., and P.G. designed and implemented the encoding and decoding pipeline. S.D.A., M.Z.R., G.K., Ke.S., and C.N.T., collected and analyzed data. B.N., C.N.T., S.N., G.G., H.Y.P., R.C., and J.M. assisted in designing and evaluating experiments. D.C., G.S., L.C., and Ka.S. designed experiments, analyzed data and supervised the work.

Competing interests

S.D.A., Y.-J.C., S.Y., K.M., M.Z.R., G.K., P.G., B.N., H.-Y.P., C.R., G.G., R.C., J.M., D.C., and K.S. are or were employees at Microsoft Research.

Corresponding authors

Correspondence to Luis Ceze or Karin Strauss.

Integrated supplementary information

Supplementary information

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nbt.4079

Rights and permissions

To obtain permission to re-use content from this article visit RightsLink.