Because of its longevity and enormous information density, DNA is considered a promising data storage medium. In this work, we provide instructions for archiving digital information in the form of DNA and for subsequently retrieving it from the DNA. In principle, information can be represented in DNA by simply mapping the digital information to DNA and synthesizing it. However, imperfections in synthesis, sequencing, storage and handling of the DNA induce errors within the molecules, making error-free information storage challenging. The procedure discussed here enables error-free storage by protecting the information using error-correcting codes. Specifically, in this protocol, we provide the technical details and precise instructions for translating digital information to DNA sequences, physically handling the biomolecules, storing them and subsequently re-obtaining the information by sequencing the DNA. Along with the protocol, we provide computer code that automatically encodes digital information to DNA sequences and decodes the information back from DNA to a digital file. The required software is provided on a Github repository. The protocol relies on commercial DNA synthesis and DNA sequencing via Illumina dye sequencing, and requires 1–2 h of preparation time, 1/2 d for sequencing preparation and 2–4 h for data analysis. This protocol focuses on storage scales of ~100 kB to 15 MB, offering an ideal starting point for small experiments. It can be augmented to enable higher data volumes and random access to the data and also allows for future sequencing and synthesis technologies, by changing the parameters of the encoder/decoder to account for the corresponding error rates.
Subscribe to Journal
Get full journal access for 1 year
only $41.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Data and code availability
Valladas, H. et al. Radiocarbon AMS dates for paleolithic cave paintings. Radiocarbon 43, 977–986 (2001).
Kutschera, W. & Rom, W. Ötzi, the prehistoric Iceman. Nucl. Instr. Methods Phys. Res. 164, 12–22 (2000).
Keller, A. et al. New insights into the Tyrolean Iceman’s origin and phenotype as inferred by whole-genome sequencing. Nat. Commun. 3, 698 (2012).
Rutten, M., Vaandrager, F. W., Elemans, J. A. A. W. & Nolte, R. J. M. Encoding information into polymers. Nat. Rev. Chem. 2, 365–381 (2018).
Neiman, M. S. Some fundamental issues of microminiaturization. Radiotekhnika 2, 3–12 (1964).
Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).
Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012).
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. Engl. 54, 2552–2555 (2015).
Yazdi, S. M. H. T., Yuan, Y., Ma, J., Zhao, H. & Milenkovic, O. A rewritable, random-access DNA-based storage system. Sci. Rep. 5, 14138 (2015).
Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–250 (2018).
Bergamin, F. Entire music album to be stored on DNA. ETH Zürich https://www.ethz.ch/en/news-and-events/eth-news/news/2018/04/entire-music-album-to-be-stored-on-DNA.html (2018).
Hesketh, E. E., Sayir, J. & Goldman, N. Improving communication for interdisciplinary teams working on storage of digital information in DNA. F1000Res. 7, 39 (2018).
Lu, H., Giordano, F. & Ning, Z. Oxford nanopore MinION sequencing and genome assembly. Genomics Proteom. Bioinforma. 14, 265–279 (2016).
Bossert, M. Channel Coding for Telecommunications (Wiley, 1999).
Heckel, R., Mikutis, G. & Grass, R. N. A characterization of the DNA data storage channel. Sci. Rep. 9, 9663 (2018).
Singleton, R. C. Maximum distance Q-nary codes. IEEE Trans. Inf. Theory 10, 116–118 (1964).
Costello, D. J. Jr & Forney, G. D. Jr Channel coding: the road to channel capacity. Proc. IEEE 95, 1150–1177 (2007).
Reed, I. S. A brief history of the development of error correcting codes. Comput. Math. Appl. 39, 89–93 (2000).
MacKay, D. J. C. Fountain codes. IEEE Commun. 152, 1062–2425 (2005).
Heckel, R. An archive written in DNA. Nat. Biotechnol. 36, 236–237 (2018).
Heckel, R., Shomorony, I., Ramchandran, K. & Tse, D. N. C. Fundamental limits of DNA storage systems. 2017 IEEE International Symposium on Information Theory (ISIT), 3130–3134 (2017).
Shomorony, I. & Heckel, R. Capacity results for the noisy shuffling channel. 2019 IEEE International Symposium on Information Theory (ISIT), 762–766 (2019).
Paunescu, D., Puddu, M., Soellner, J. O. B., Stoessel, P. R. & Grass, R. N. Reversible DNA encapsulation in silica to produce ROS-resistant and heat-resistant synthetic DNA ‘fossils’. Nat. Protoc. 8, 2440–2448 (2013).
Bonnet, J. et al. Chain and conformation stability of solid-state DNA: implications for room temperature storage. Nucleic Acids Res. 38, 1531–1546 (2009).
Nakata, T. & Kubo, I. A coupon collector’s problem with bonuses. DMTCS Proc. AG, 215–224 (2006).
Blawat, M. et al. Forward error correction for DNA data storage. Procedia Comput. Sci. 80, 1011–1022 (2016).
Hamming, R. W. Error detecting and error correcting codes. Bell Syst. Tech. J. 29, 147–160 (1950).
Gottesman, D. Efficient fault tolerance. Nature 450, 44–45 (2016).
Campbell, E. T., Terhal, B. M. & Vuillot, C. Roads towards fault-tolerant universal quantum computation. Nature 549, 172–179 (2017).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Solomon, G. & Reed, I. S. Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8, 300–304 (1960).
Michelson, A. M. & Todd, A. R. Nucleotides part XXXII. Synthesis of a dithymidine dinucleotide containing a 3’: 5’-internucleotidic linkage. J. Chem. Soc. 0, 2632–2638 (1955).
Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
Custom Microarrays and Oligo Pools. CustomArray http://www.customarrayinc.com/oligos_main.htm (accessed 8 April 2019).
LeProust, E. M. et al. Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522–2540 (2010).
Bioscience & Twist. Case Update—Agilent v. Twist Litigation (2019).
Maurer, K. et al. Electrochemically generated acid and its containment to 100 micron reaction areas for the production of DNA microarrays. PLOS ONE 1, e34 (2006).
Yazdi, S. M. H. T. et al. DNA-based storage: trends and methods. IEEE Trans. Mol. Biol. Multi Scale Commun. 1, 230–248 (2015).
Palluk, S. et al. De novo DNA synthesis using polymerasenucleotide conjugates. Nat. Biotechnol. 36, 645–650 (2018).
Plesa, C., Sidore, A. M., Lubock, N. B., Zhang, D. & Kosuri, S. Multiplexed gene synthesis in emulsions for exploring protein functional landscapes. Science 359, 343–347 (2018).
Lee, H. H., Kalhor, R., Goela, N., Bolot, J. & Church, G. M. Terminator-free template-independent enzymatic DNA synthesis for digital information storage. Nat. Commun. 10, 2383 (2019).
We thank ICB/ETH Zurich for funding and the Beat Christen Group at ETH for giving access to the iSeq 100 sequencer.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Key references using this protocol
Grass, R. et al. Angew. Chem. Int. Ed. 54, 2552–2555 (2015): https://doi.org/10.1002/anie.201411378
Chen, W. et al. Adv. Funct. Mater. 29, 1–8 (2019): https://doi.org/10.1002/adfm.201901672
Heckel, R. et al. Sci. Rep. 9, 9663 (2019): https://doi.org/10.1038/s41598-019-45832-6
README file. Description of coding scheme with additional explanations of coding parameters and examples for how to utilize the code. Additionally, code installation instructions are given for Windows, Linux, and macOS.
Error-correcting code (C++). Error-correcting scheme for storing information in DNA using Reed–Solomon codes.
Coding parameters. File to aid parameter selection by choosing redundancy, file size, number of sequences to be synthesized, and the sequence.
Files to be encoded. Sample file to be encoded as an illustrative example of the protocol’s procedure. Here the first five protocols published in Nature Protocols were chosen.
The output of the decoder using Supplementary Data 2 as input, executed on a macOS operating system with default parameters as given in the Anticipated results (K = 32, N = 34, l = 4, nuss = 12, n = 12,472, k = 9,000, resulting in sequences of length 102 nt each).
About this article
Cite this article
Meiser, L.C., Antkowiak, P.L., Koch, J. et al. Reading and writing digital data in DNA. Nat Protoc (2019) doi:10.1038/s41596-019-0244-5