Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

Journal name:
Nature
Volume:
494,
Pages:
77–80
Date published:
DOI:
doi:10.1038/nature11875
Received
Accepted
Published online

Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage1 because of its capacity for high-density information encoding, longevity under easily achieved conditions2, 3, 4 and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information5, 6, 7 or were not amenable to scaling-up8, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival9. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information10 of 5.2×106 bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.

At a glance

Figures

  1. Digital information encoding in DNA.
    Figure 1: Digital information encoding in DNA.

    Digital information (a, in blue), here binary digits holding the ASCII codes for part of Shakespeare’s sonnet 18, was converted to base-3 (b, red) using a Huffman code that replaces each byte with five or six base-3 digits (trits). This in turn was converted in silico to our DNA code (c, green) by replacement of each trit with one of the three nucleotides different from the previous one used, ensuring no homopolymers were generated. This formed the basis for a large number of overlapping segments of length 100bases with overlap of 75bases, creating fourfold redundancy (d, green and, with alternate segments reverse complemented for added data security, violet). Indexing DNA codes were added (yellow), also encoded as non-repeating DNA nucleotides. See Supplementary Information for further details.

  2. Scaling properties and robustness of DNA-based storage.
    Figure 2: Scaling properties and robustness of DNA-based storage.

    a, Encoding efficiency and costs change as the amount of stored information increases. The x axis (logarithmic scale) represents the total amount of information to be encoded. Common data scales are indicated, including the three zettabyte (3ZB, 3×1021 bytes) global data estimate, shown red. The black line (y-axis scale to left) indicates encoding efficiency, measured as the proportion of synthesized bases available for data encoding. The blue curves (y-axis scale to right) indicate the corresponding effect on encoding costs, both at current synthesis cost levels (solid line) and in the case of a two-order-of-magnitude reduction (dashed line). b, Per-recovered-base error rate (y axis) as a function of sequencing coverage, represented by the percentage of the original 79.6×106 read-pairs sampled (x axis; logarithmic scale). The blue curve represents the four files recovered without human intervention: the error is zero when ≥2% of the original reads are used. The grey curve is obtained by Monte Carlo simulation from our theoretical error rate model. The orange curve represents the file (watsoncrick.pdf) that required manual correction: the minimum possible error rate is 0.0036%. The boxed area is shown magnified in the inset. c, Timescales for which DNA-based storage is cost-effective. The blue curve indicates the relationship between break-even time beyond which DNA storage is less expensive than magnetic tape (x axis) and relative cost of DNA-storage synthesis and tape transfer fixed costs (y axis), assuming the tape archive has to be read and rewritten every 5yr. The orange curve corresponds to tape transfers every 10yr; broken curves correspond to other transfer periods as indicated. In the green-shaded region, DNA storage is cost-effective when transfers occur more frequently than every 10yr; in the yellow-shaded region, DNA storage is cost-effective when transfers occur every 5–10yr; in the red-shaded region tape is less expensive when transfers occur less frequently than every 5yr. Grey-shaded ranges of relative costs of DNA synthesis to tape transfer are 125–500 (current costs for 1MB of data), 12.5–50 (achieved if DNA synthesis costs are reduced by one order of magnitude) and 1.25–5 (costs reduced by two orders of magnitude). Note the logarithmic scales on both axes. See Supplementary Information for further details.

Accession codes

Primary accessions

Sequence Read Archive

References

  1. Baum, E. B. Building an associative memory vastly larger than the brain. Science 268, 583585 (1995)
  2. Cox, J. P. L. Long-term data storage in DNA. Trends Biotechnol. 19, 247250 (2001)
  3. Anchordoquy, T. J. & Molina, M. C. Preservation of DNA. Cell Preserv. Technol. 5, 180188 (2007)
  4. Bonnet, J. et al. Chain and conformation stability of solid-state DNA: implications for room temperature storage. Nucleic Acids Res. 38, 15311546 (2010)
  5. Clelland, C. T., Risca, V. & Bancroft, C. Hiding messages in DNA microdots. Nature 399, 533534 (1999)
  6. Kac, E. Genesis (1999); available at http://www.ekac.org/geninfo.html (accessed, 10 May 2012)
  7. Ailenberg, M. & Rotstein, O. D. An improved Huffman coding method for archiving text, images, and music characters in DNA. Biotechniques 47, 747754 (2009)
  8. Gibson, D. G. et al. Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 5256 (2010)
  9. Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012)
  10. MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms (Cambridge Univ. Press, 2003)
  11. Erlich, H. A., Gelfand, D. & Sninsky, J. J. Recent advances in the polymerase chain reaction. Science 252, 16431651 (1991)
  12. Monaco, A. P. & Larin, Z. YACs, BACs, PACs and MACs: artificial chromosomes as research tools. Trends Biotechnol. 12, 280286 (1994)
  13. Carr, P. A. & Church, G. M. Genome engineering. Nature Biotechnol. 27, 11511162 (2009)
  14. Willerslev, E. et al. Ancient biomolecules from deep ice cores reveal a forested southern Greenland. Science 317, 111114 (2007)
  15. Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710722 (2010)
  16. Kari, L. & Mahalingam, K. in Algorithms and Theory of Computation Handbook Vol. 2, 2nd edn (eds Atallah, M. J. & Blanton, M.) 31-1–31-24 (Chapman & Hall, 2009)
  17. Păun, G., Rozenberg, G. & Salomaa, A. DNA Computing: New Computing Paradigms (Springer, 1998)
  18. Watson, J. D. & Crick, F. H. C. Molecular structure of nucleic acids. Nature 171, 737738 (1953)
  19. Niedringhaus, T. P., Milanova, D., Kerby, M. B., Snyder, M. P. & Barron, A. E. Landscape of next-generation sequencing technologies. Anal. Chem. 83, 43274341 (2011)
  20. LeProust, E. M. et al. Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 25222540 (2010)
  21. Massingham, T. & Goldman, N. All Your Base: a fast and accurate probabilistic approach to base calling. Genome Biol. 13, R13 (2012)
  22. Gantz, J. & Reinsel, D. Extracting Value from Chaos (IDC, 2011)
  23. Brand, S. The Clock of the Long Now (Basic Books, 1999)
  24. Digital. archiving. History flushed. Economist 403, 56–57 (28 April 2012); available at http://www.economist.com/node/21553410 (2012)
  25. Bessone, N., Cancio, G., Murray, S. & Taurelli, G. Increasing the efficiency of tape-based storage backends. J. Phys. Conf. Ser. 219, 062038 (2010)
  26. Baker, M. et al. in Proc. 1st ACM SIGOPS/EuroSys European Conf. on Computer Systems (eds Berbers, Y. & Zwaenepoel, W.) 221234 (ACM, 2006)
  27. Yuille, M. et al. The UK DNA banking network: a “fair access” biobank. Cell Tissue Bank. 11, 241251 (2010)
  28. Global Crop Diversity Trust Svalbard Global Seed Vault. (2012); available at http://www.croptrust.org/main/content/svalbard-global-seed-vault (accessed, 10 May 2012)

Download references

Author information

Affiliations

  1. European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK

    • Nick Goldman,
    • Paul Bertone,
    • Christophe Dessimoz,
    • Botond Sipos &
    • Ewan Birney
  2. Agilent Technologies, Genomics–LSSU, 5301 Stevens Creek Boulevard, Santa Clara, California 95051, USA

    • Siyuan Chen &
    • Emily M. LeProust

Contributions

N.G. and E.B. conceived and planned the project and devised the information-encoding methods. P.B. advised on oligo design and Next-Generation Sequencing protocols, prepared the DNA library and managed the sequencing process. S.C. and E.M.L. provided custom oligonucleotides. N.G. wrote the software for encoding and decoding information into/from DNA and analysed the data. N.G., E.B., C.D. and B.S. modelled the scaling properties of DNA storage. N.G. wrote the paper with discussions and contributions from all other authors. N.G. and C.D. produced the figures.

Competing financial interests

S.C. and E.M.L. are employees of Agilent Technologies, a commercial provider of OLS pools. N.G. and E.B. are named inventors on a patent application on technologies described in this work.

Corresponding author

Correspondence to:

Data are available at http://www.ebi.ac.uk/goldman-srv/DNA-storage and in the Sequence Read Archive (SRA) with accession number ERP002040.

Author details

Supplementary information

PDF files

  1. Supplementary Information 1 (1.9M)

    This file contains Supplementary Tables 1-4, Supplementary Figures 1-9, Supplementary Methods and Data, a Supplementary Discussion and Supplementary references. This file was replaced on 14 February 2013 to correct the DNA sequence in Supplementary Figure 8, which was misaligned.

  2. Supplementary Information 2 (244K)

    This file contains the full formal specification of the digital information encoding scheme.

  3. Supplementary Information 3 (411K)

    This file contains FastQC QC report on Illumina HiSeq 2000 sequencing run.

Zip files

  1. Supplementary Data 1 (647K)

    This zipped file contains the five original files encoded and decoded in this study, namely wssnt10.txt (ASCII text file containing text of all 154 Shakespeare sonnets), watsoncrick.pdf (PDF of Watson & Crick’s (1953) paper describing the structure of DNA), MLK_excerpt_VBR_45-85.mp3 (MP3 file containing a 26 s excerpt from Martin Luther King's 1963 "I Have A Dream" speech), EBI.jp2 (JPEG 2000 format medium resolution colour photograph of the European Bioinformatics Institute) and View_huff3.cd.new (ASCII text file defining the Huffman code used to convert bytes of encoded files to base 3).

Text files

  1. Supplementary Data 2 (6K)

    This file contains the GATK ErrorRatePerCycle report on Illumina HiSeq 2000 sequencing run.

Comments

  1. Report this comment #57415

    Thomas Dandekar said:

    Light-gated DNA storage is essential

    COMMENT ON N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. LeProust, B. Sipos & E. Birney Nature 494, 77-80 (2013)

    Goldman et al. (1) show that DNA can in fact be used as solid and permanent data storage. In a certain sense this is obvious (we all use it as genetic storage not only all our life, but in fact all our evolutionary history long). However, demonstrating the ?obvious?, Goldman et al. (1) transformed the general statement into a strong proof-of-concept example for the potential of DNA as a storage technology. Arguably, the key steps presented were already shown recently (2) or, at least conceptual, even long before (3,4). Nevertheless, the strong proof of concept shown by Goldman et al.1 in their inspiring paper is a milestone towards using DNA information storage technology: They describe a scalable method to reliably store large volumes of information in DNA with 100% accuracy for large-scale, long-term and infrequently accessed digital archiving. However, we argue here that we need a critical technology advance more before a junction between electronic data processing and molecular data storage and processing can really take off. There is a serious threat that otherwise DNA storage will never get momentum: This risk is supported for instance by the up-till now failure (in spite of inspiring inventions) to really translate DNA computing (5) and RNA logical gates (6) into a technology delivering not only ?interesting? results but technological power and spread. To achieve a robust, addressable and user-friendly molecular information processing technology we argue that a combination of the advantages of nanotechnology, molecular biology and external, user-specific control is necessary, similarly as PCR took off combining primer directed specific recall of information with robust, heat stable Taq polymerase. We claim that a direct feedback of technical input into molecular circuits is necessary as well as direct feed-into of the molecular result into technical processing machinery for DNA storage to take off (e.g. not requiring a technical apparatus and cumbersome sequencing steps to decipher the stored information). A direct connection from molecular processing in cells and DNA to technical computers is necessary to achieve speed and calculation potential. Electronic properties of DNA (7) are difficult to handle. We suggest for linking DNA information processing to in silico processing step-by-step in an efficient way light-gated proteins (8). Light-gated proteins allow (i) control of their own and other enzyme activities, (ii) gene expression and protein-protein interactions, as well as (iii) to achieve patterning and directing cell to cell communication and integration of circuits. Containment features control the high biological repair and replication potential of such biobricks (9) which together achieve extremely robust active DNA storage technology without negative side-effects or uncontrolled risks. Critical steps needed to be achieved and a blueprint of the design of the active DNA storage we currently explore include light gated protein constructs to achieve rapid light-directed DNA synthesis as well as direct DNA-sequence readout via optical signals. In conclusion, our claim to the recent work by Goldman et al. (1) is that active DNA storage technology is critical so that DNA storage can really take off and will be broadly used. This includes user directed molecular DNA synthesis and sequencing, in particular by light-gated proteins. Without active DNA storage, the technology will remain a technological tour de force, in ten years maybe cheap but slow in effective information recall, let alone calculations.

    Thomas Dandekar [1,2], Daniel Lopez3, Dominik Schaack 1
    1 -Dept. of Bioinformatics, Biocenter, University of Würzburg, Am Hubland, 97074 Würzburg, Germany. e-mail: dandekar@biozentrum.uni-wuerzburg.de; phone ++49-931-318-4551; Fax -4552;
    2 -EMBL, Meyerhofstrasse 1, 69117 Heidelberg, Germany
    3 -Research center for infectious disease, Josef Schneider Str. 2/ D15, 97080 Würzburg, Germany

Subscribe to comments

Additional data