The density and long-term stability of DNA make it an appealing storage medium, particularly for long-term data archiving. Existing DNA storage technologies involve the synthesis and sequencing of multiple nominally identical molecules in parallel, resulting in information redundancy. We report the development of encoding and decoding methods that exploit this redundancy using composite DNA letters. A composite DNA letter is a representation of a position in a sequence that consists of a mixture of all four DNA nucleotides in a predetermined ratio. Our methods encode data using fewer synthesis cycles. We encode 6.4 MB into composite DNA, with distinguishable composition medians, using 20% fewer synthesis cycles per unit of data, as compared to previous reports. We also simulate encoding with larger composite alphabets, with distinguishable composition deciles, to show that 75% fewer synthesis cycles are potentially sufficient. We describe applicable error-correcting codes and inference methods, and investigate error patterns in the context of composite DNA letters.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $20.83 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All raw sequencing data are available from the European Nucleotide Archive (ENA) under accession PRJEB32427. This includes sequencing of the large-scale experiment described in Figs. 2–4, sequencing of the experiment with large alphabets described in Fig. 5 and sequencing of the error analysis experiment described in Fig. 5. All other data are available within the article or its supplementary information.
All original software code included in this study is available online. Alteration of the previously published DNA fountain code to support composite DNA is available from https://github.com/leon-anavy/dna-fountain. Code used for Reed–Solomon error correction (altered from previously published code) is available from https://github.com/leon-anavy/Reed-Solomon. Custom code used for the analyses presented in this study is available from https://github.com/leon-anavy/composite-DNA.
Cox, J. P. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).
Zhirnov, V., Zadegan, R. M., Sandhu, G. S., Church, G. M. & Hughes, W. L. Nucleic acid memory. Nat. Mater. 15, 366–370 (2016).
Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012).
Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).
Bornholt, J. et al. Toward a DNA-based archival storage system. IEEE Micro 37, 98–104 (2017).
Tabatabaei Yazdi, S. M. H. et al. A rewritable, random-access DNA-based storage system. Sci. Rep. 5, 14138 (2015).
Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Gabrys, R., Kiah, H. M. & Milenkovic, O. Asymmetric lee distance codes for DNA-based storage. In Proc. 2015 IEEE International Symposium on Information Theory (ISIT) 909–913 (IEEE, 2015)..
Levy, M. & Yaakobi, E. Mutually uncorrelated codes for DNA storage. In Proc. 2017 IEEE International Symposium on Information Theory (ISIT) 3115–3119 (IEEE, 2017).
Lee, H. H., Kalhor, R., Goela, N., Bolot, J. & Church, G. M. Terminator-free template-independent enzymatic DNA synthesis for digital information storage. Nat. Commun. 10, 2383 (2019).
Palluk, S. et al. De novo DNA synthesis using polymerase–nucleotide conjugates. Nat. Biotechnol. 36, 645–650 (2018).
Roquet, N., Park, H. & Bhatia, S. P. Nucleic acid-based data storage. US patent 20180137418 (2017).
LeProust, E. M. et al. Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522–2540 (2010).
Barrett, M. T. et al. Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA. Proc. Natl Acad. Sci. USA 101, 17765–17770 (2004).
Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
Choi, Y. et al. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases. Sci. Rep. 9, 6582 (2019).
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. Engl. 54, 2552–2555 (2015).
Reed, I. S. & Solomon, G. Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8, 300–304 (1960).
MacKay, D. J. C. Fountain codes. IEE Proc. Comm. 152, 1062 (2005).
Jiménez-Sánchez, A. DNA computer code based on expanded genetic alphabet. Eur. J. Comput. Sci. Inf. Technol. 2, 8–20 (2014).
Tabatabaei Yazdi, S. M. H. et al. DNA-based storage: trends and methods. IEEE Trans. Mol. Biol. Multiscale Commun. 1, 230–248 (2015).
Raviv, N., Schwartz, M. & Yaakobi, E. Rank modulation codes for DNA storage. In Proc. 2017 IEEE International Symposium on Information Theory (ISIT) 3125–3129 (IEEE, 2017).
Yazdi, S. M. H. T., Kiah, H. M., Gabrys, R. & Milenkovic, O. Mutually uncorrelated primers for DNA-based data storage. Preprint at https://arxiv.org/abs/1709.05214 (2017).
Takahashi, C. N., Nguyen, B. H., Strauss, K. & Ceze, L. Demonstration of end-to-end automation of DNA data storage. Sci. Rep. 9, 4998 (2019).
Hoshika, S. et al. Hachimoji DNA and RNA: a genetic system with eight building blocks. Science 363, 884–887 (2019).
Bains, W. Hybridization methods for DNA sequencing. Genomics 11, 94–301 (1991).
Pevzner, P. A. Rearrangements of DNA sequences and SBH. Comput. Chem. 18, 221–223 (1994).
Preparata, F. P. & Oliver, J. S. DNA sequencing by hybridization using semi-degenerate bases. J. Comput. Biol. 11, 753–765 (2004).
Snir, S., Yeger-Lotem, E., Chor, B., and Yakhini, Z. Using restriction enzymes to improve sequencing by hybridization. Technical report CS-2002-14 (Technion, 2002).
Chen, Z. et al. Highly accurate fluorogenic DNA sequencing with information theory-based error correction. Nat. Biotechnol. 35, 1170–1178 (2017).
Davidson, E. H. The Regulatory Genome: Gene Regulatory Networks in Development and Evolution (Academic, 2006).
Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94 (2004).
Levy, L. et al. A synthetic oligo library and sequencing approach reveals an insulation mechanism encoded within bacterial σ54 promoters. Cell Rep. 21, 845–858 (2017).
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Gilbert, L. A. et al. CRISPR-mediated modular RNA-guided regulation of transcription in eukaryotes. Cell 154, 442–451 (2013).
Mikutis, G. et al. Silica-encapsulated DNA-based tracers for aquifer characterization. Environ. Sci. Technol. 52, 12142–12152 (2018).
Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina paired-end read merger. Bioinformatics 30, 614–620 (2014).
Shakespeare, W. The Complete Works of William Shakespeare http://www.gutenberg.org/ebooks/100 (1994)
Huffman, D. A. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 1098–1101 (1952).
We thank T. Katz-Ezov and T. Hashimshony from the Technion Genome Center for advice and assistance with oligonucleotide design and sequencing experiments. We also thank P. Weiss from Twist Bioscience for technical support and assistance with DNA synthesis. Finally, we thank the Yakhini and Amit research groups for valuable comments and discussions. L. Anavy is supported by the Adams Fellowships Program of the Israel Academy of Sciences and Humanities. This project received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement 664918 (MRG-Grammar).
L.A, Z.Y and R.A are the inventors of a patent application for the method described in this article. The initial filing was assigned United States Provisional Patent Application No. 62/674,114. The remaining authors declare no competing financial interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Figs. 1–17 and Supplementary Note
Physical density calculations of composite DNA storage. This includes the large-scale experiment and the dilution experiment.
Logical density calculations of composite DNA storage. This includes all the experiments, theoretical encodings and simulation experiments.
Oligonucleotide design for large-alphabet experiments and error analysis.
Oligonucleotide design for the large-scale composite DNA storage.
Oligonucleotide design for the simulations of large composite alphabet DNA storage.
Simulation results of large composite alphabet DNA storage