Robust direct digital-to-biological data storage in living cells

Yim, Sung Sun; McBee, Ross M.; Song, Alan M.; Huang, Yiming; Sheth, Ravi U.; Wang, Harris H.

doi:10.1038/s41589-020-00711-4

Article
Published: 11 January 2021

Robust direct digital-to-biological data storage in living cells

Nature Chemical Biology volume 17, pages 246–253 (2021)Cite this article

13k Accesses
43 Citations
386 Altmetric
Metrics details

Subjects

Abstract

DNA has been the predominant information storage medium for biology and holds great promise as a next-generation high-density data medium in the digital era. Currently, the vast majority of DNA-based data storage approaches rely on in vitro DNA synthesis. As such, there are limited methods to encode digital data into the chromosomes of living cells in a single step. Here, we describe a new electrogenetic framework for direct storage of digital data in living cells. Using an engineered redox-responsive CRISPR adaptation system, we encoded binary data in 3-bit units into CRISPR arrays of bacterial cells by electrical stimulation. We demonstrate multiplex data encoding into barcoded cell populations to yield meaningful information storage and capacity up to 72 bits, which can be maintained over many generations in natural open environments. This work establishes a direct digital-to-biological data storage framework and advances our capacity for information exchange between silicon- and carbon-based entities.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Direct digital-to-biological data storage into CRISPR arrays.**

**Fig. 2: Encoding 3-bit binary data into *Escherichia coli* populations.**

**Fig. 3: Writing the text message ‘hello world!’ containing 72 bits into barcoded *E. coli* cells.**

**Fig. 4: Cell envelope as a physical barrier to protect data.**

A mixed culture of bacterial cells enables an economic DNA storage on a large scale

Article Open access 31 July 2020

Towards practical and robust DNA-based data archiving using the yin–yang codec system

Article Open access 25 April 2022

Reading and writing digital data in DNA

Article 29 November 2019

Data availability

All data supporting the findings of this study are available within the Article and its Supplementary Information or are available from the authors upon request. Sequencing data associated with this study are available at NCBI SRA under PRJNA625964.

Code availability

All of the CRISPR spacer extraction and mapping software can be accessed at https://github.com/ravisheth/trace or are available from the authors upon request.

References

Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012).
Article CAS PubMed Google Scholar
Erlich, Y. & Zielinski, D. DNA fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
Article CAS PubMed Google Scholar
Allentoft, M. E. et al. The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proc. Biol. Sci. 279, 4724–4733 (2012).
CAS PubMed PubMed Central Google Scholar
Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).
Article CAS PubMed Google Scholar
Newman, S. et al. High density DNA data storage library via dehydration with digital microfluidic retrieval. Nat. Commun. 10, 1706 (2019).
Article CAS PubMed PubMed Central Google Scholar
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Article CAS PubMed Google Scholar
Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol. 37, 1229–1236 (2019).
Article CAS PubMed Google Scholar
Lee, H. H., Kalhor, R., Goela, N., Bolot, J. & Church, G. M. Terminator-free template-independent enzymatic DNA synthesis for digital information storage. Nat. Commun. 10, 2383 (2019).
Article PubMed PubMed Central Google Scholar
Farzadfard, F. & Lu, T. K. Emerging applications for DNA writers and molecular recorders. Science 361, 870–875 (2018).
Article CAS PubMed PubMed Central Google Scholar
Sheth, R. U. & Wang, H. H. DNA-based memory devices for recording cellular events. Nat. Rev. Genet. 19, 718–732 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chan, M. M. et al. Molecular recording of mammalian embryogenesis. Nature 570, 77–82 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kalhor, R. et al. Developmental barcoding of whole mouse via homing CRISPR. Science 361, eaat9804 (2018).
Article PubMed PubMed Central Google Scholar
McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016).
Article PubMed PubMed Central Google Scholar
Munck, C., Sheth, R. U., Freedberg, D. E. & Wang, H. H. Recording mobile DNA in the gut microbiota using an Escherichia coli CRISPR-Cas spacer acquisition platform. Nat. Commun. 11, 95 (2020).
Article CAS PubMed PubMed Central Google Scholar
Farzadfard, F. & Lu, T. K. Genomically encoded analog memory with precise in vivo DNA writing in living cell populations. Science 346, 1256272 (2014).
Article PubMed PubMed Central Google Scholar
Loveless, T. B. et al. DNA writing at a single genomic site enables lineage tracing and analog recording in mammalian cells. Preprint at bioRxiv https://doi.org/10.1101/639120 (2019).
Schmidt, F., Cherepkova, M. Y. & Platt, R. J. Transcriptional recording by CRISPR spacer acquisition from RNA. Nature 562, 380–385 (2018).
Article CAS PubMed Google Scholar
Tang, W. & Liu, D. R. Rewritable multi-event analog recording in bacterial and mammalian cells. Science 360, eaap8992 (2018).
Article PubMed PubMed Central Google Scholar
Yang, L. et al. Permanent genetic memory with >1-byte capacity. Nat. Methods 11, 1261–1266 (2014).
Article CAS PubMed PubMed Central Google Scholar
Mimee, M., Tucker, A. C., Voigt, C. A. & Lu, T. K. Programming a human commensal bacterium, Bacteroides thetaiotaomicron, to sense and respond to stimuli in the murine gut microbiota. Cell Syst. 1, 62–71 (2015).
Article CAS PubMed PubMed Central Google Scholar
Riglar, D. T. et al. Engineered bacteria can function in the mammalian gut long-term as live diagnostics of inflammation. Nat. Biotechnol. 35, 653–658 (2017).
Article CAS PubMed PubMed Central Google Scholar
Sheth, R. U., Yim, S. S., Wu, F. L. & Wang, H. H. Multiplex recording of cellular events over time on CRISPR biological tape. Science 358, 1457–1461 (2017).
Article CAS PubMed PubMed Central Google Scholar
Roquet, N., Soleimany, A. P., Ferris, A. C., Aaronson, S. & Lu, T. K. Synthetic recombinase-based state machines in living cells. Science 353, aad8559 (2016).
Article PubMed Google Scholar
Farzadfard, F. et al. Single-nucleotide-resolution computing and memory in living cells. Mol. Cell 75, 769–780 (2019).
Article CAS PubMed PubMed Central Google Scholar
Akhmetov, A., Ellington, A. D. & Marcotte, E. M. A highly parallel strategy for storage of digital information in living cells. BMC Biotechnol. 18, 64 (2018).
Article CAS PubMed PubMed Central Google Scholar
Shipman, S. L., Nivala, J., Macklis, J. D. & Church, G. M. CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature 547, 345–349 (2017).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Connecting biology to electronics: molecular communication via redox modality. Adv. Healthc. Mater. 6 (2017); https://doi.org/10.1002/adhm.201700789
Mimee, M. et al. An ingestible bacterial-electronic system to monitor gastrointestinal health. Science 360, 915–918 (2018).
Article CAS PubMed PubMed Central Google Scholar
Weber, W. et al. A synthetic mammalian electro-genetic transcription circuit. Nucleic Acids Res. 37, e33 (2009).
Article PubMed PubMed Central Google Scholar
Bellin, D. L. et al. Electrochemical camera chip for simultaneous imaging of multiple metabolites in biofilms. Nat. Commun. 7, 10535 (2016).
Article CAS PubMed PubMed Central Google Scholar
VanArsdale, E. et al. A co-culture based tyrosine-tyrosinase electrochemical gene circuit for connecting cellular communication with electronic networks. ACS Synth. Biol 9, 1117–1128 (2020).
Article CAS PubMed Google Scholar
Gordonov, T. et al. Electronic modulation of biochemical signal generation. Nat. Nanotechnol. 9, 605–610 (2014).
Article CAS PubMed Google Scholar
Tschirhart, T. et al. Electronic control of gene expression and cell behaviour in Escherichia coli through redox signalling. Nat. Commun. 8, 14030 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bhokisham, N. et al. A redox-based electrogenetic CRISPR system to connect with and control biological information networks. Nat. Commun. 11, 2427 (2020).
Article CAS PubMed PubMed Central Google Scholar
Nunez, J. K., Lee, A. S., Engelman, A. & Doudna, J. A. Integrase-mediated spacer acquisition during CRISPR-Cas adaptive immunity. Nature 519, 193–198 (2015).
Article CAS PubMed PubMed Central Google Scholar
Michel, J. B. et al. Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011).
Article CAS PubMed Google Scholar
Nivala, J., Shipman, S. L. & Church, G. M. Spontaneous CRISPR loci generation in vivo by non-canonical spacer integration. Nat. Microbiol. 3, 310–318 (2018).
Article CAS PubMed PubMed Central Google Scholar
Din, M. O., Martin, A., Razinkov, I., Csicsery, N. & Hasty, J. Interfacing gene circuits with microelectronics through engineered population dynamics. Sci. Adv. 6, eaaz8344 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fernandez-Rodriguez, J., Moser, F., Song, M. & Voigt, C. A. Engineering RGB color vision into Escherichia coli. Nat. Chem. Biol. 13, 706–708 (2017).
Article CAS PubMed Google Scholar
Piraner, D. I., Abedi, M. H., Moser, B. A., Lee-Gosselin, A. & Shapiro, M. G. Tunable thermal bioswitches for in vivo control of microbial therapeutics. Nat. Chem. Biol. 13, 75–80 (2017).
Article CAS PubMed Google Scholar
Makarova, K. S. et al. Evolutionary classification of CRISPR-Cas systems: a burst of class 2 and derived variants. Nat. Rev. Microbiol. 18, 67–83 (2020).
Article CAS PubMed Google Scholar
Heler, R. et al. Mutations in Cas9 enhance the rate of acquisition of viral spacer sequences during the CRISPR-Cas immune response. Mol. Cell 65, 168–175 (2017).
Article CAS PubMed Google Scholar
Shipman, S. L., Nivala, J., Macklis, J. D. & Church, G. M. Molecular recordings by directed CRISPR spacer acquisition. Science 353, aaf1175 (2016).
Article PubMed PubMed Central Google Scholar
Wright, A. V. et al. A functional mini-integrase in a two-protein type V-C CRISPR system. Mol. Cell 73, 727–737 (2019).
Article CAS PubMed PubMed Central Google Scholar
Blazejewski, T., Ho, H.-I. & Wang, H. H. Synthetic sequence entanglement augments stability and containment of genetic information in cells. Science 365, 595–598 (2019).
Article CAS PubMed Google Scholar
Deatherage, D. E., Leon, D., Rodriguez, A. E., Omar, S. K. & Barrick, J. E. Directed evolution of Escherichia coli with lower-than-natural plasmid mutation rates. Nucleic Acids Res. 46, 9236–9250 (2018).
Article CAS PubMed PubMed Central Google Scholar
Cox, J. P. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).
Article CAS PubMed Google Scholar
Li, F. et al. Modular engineering to increase intracellular NAD (H/⁺) promotes rate of extracellular electron transfer of Shewanella oneidensis. Nat. Commun. 9, 3637 (2018).
Article PubMed PubMed Central Google Scholar
Lee, H. H. et al. Functional genomics of the rapidly replicating bacterium Vibrio natriegens by CRISPRi. Nat. Microbiol. 4, 1105–1113 (2019).
Article CAS PubMed Google Scholar
Davis, J. et al. In vivo multi-dimensional information-keeping in Halobacterium salinarum. Preprint at bioRxiv https://doi.org/10.1101/2020.02.14.949925 (2020).
Sharan, S. K., Thomason, L. C., Kuznetsov, S. G. & Court, D. L. Recombineering: a homologous recombination-based method of genetic engineering. Nat. Protoc. 4, 206–223 (2009).
Article CAS PubMed PubMed Central Google Scholar
St-Pierre, F. et al. One-step cloning and chromosomal integration of DNA. ACS Synth. Biol. 2, 537–541 (2013).
Article CAS PubMed Google Scholar
Ji, B. W. et al. Quantifying spatiotemporal variability and noise in absolute microbiota abundances using replicate sampling. Nat. Methods 16, 731–736 (2019).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Article CAS PubMed Google Scholar
Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267 (2007).
Article CAS PubMed PubMed Central Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank A. Kaufman and members of the Wang laboratory for advice and comments on the manuscript. H.H.W. acknowledges funding support from ONR (N00014-17-1-2353), NSF (MCB‐1453219), NIH (1R01AI132403-01) and the Burroughs Wellcome Fund (PATH1016691). S.S.Y. is grateful for support from the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education (NRF-2017R1A6A3A03003401). R.U.S. was supported by a Fannie and John Hertz Foundation Fellowship and an NSF Graduate Research Fellowship (DGE-1644869).

Author information

Authors and Affiliations

Department of Systems Biology, Columbia University, New York, NY, USA
Sung Sun Yim, Ross M. McBee, Alan M. Song, Yiming Huang, Ravi U. Sheth & Harris H. Wang
Department of Biological Sciences, Columbia University, New York, NY, USA
Ross M. McBee
Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY, USA
Yiming Huang & Ravi U. Sheth
Department of Pathology and Cell Biology, Columbia University, New York, NY, USA
Harris H. Wang

Authors

Sung Sun Yim
View author publications
You can also search for this author in PubMed Google Scholar
Ross M. McBee
View author publications
You can also search for this author in PubMed Google Scholar
Alan M. Song
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ravi U. Sheth
View author publications
You can also search for this author in PubMed Google Scholar
Harris H. Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.S.Y., R.U.S. and H.H.W. developed the initial concept. S.S.Y. performed experiments and analyzed the results under the supervision of H.H.W. S.S.Y., R.M.M. and A.M.S. designed and constructed the electrochemical redox controller set-up. S.S.Y. and Y.H. designed the error correction pipeline. S.S.Y. and H.H.W. wrote the mansucript, with input from all authors.

Corresponding author

Correspondence to Harris H. Wang.

Ethics declarations

Competing interests

H.H.W. is a scientific advisor to SNIPR Biome. The other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Development of a redox-sensing DNA-based cellular recorder for direct digital-to-biological data storage.

This system is composed of two distinct modules: (i) a ‘sensing module’ that converts a desired biological signal into a change in copy number of a trigger plasmid (pTrig), and (ii) a ‘writing module’ that overexpresses Cas1-Cas2 from a recording plasmid (pRec) to unidirectionally expand genomic CRISPR arrays with novel ~33 bp spacers acquired from genomic or plasmid DNA sources in the cell. In the presence of the desired signal, cells experience a shift in their intracellular DNA pool, driven by an increase in pTrig copy number, which results in an acquisition bias for pTrig-derived spacers amongst expanding CRISPR arrays. a, The lacI gene in the previous pRec²² was replaced with soxR gene from E. coli, and the lac promoter in the previous pTrig²² was replaced with soxS promoter from E. coli. P1 replication system is inactive in the absence of oxidative stress, and a mini-F origin keeps the pTrig plasmid copy number low. Upon induction with oxidative stress, SoxR detaches from soxS promoter and activates the P1 replication system to increase the copy number of the plasmid. b, pTrig copy number in the presence of various concentrations of phenazine methosulfate (PMS) in aerobic condition. pRec (with an additional copy of soxR gene) helps get higher fold-change of pTrig copy number by more efficient repression in absence of the inducer. c, pTrig copy numbers in the presence of pRec and various concentrations of PMS, and FCN(R) or FCN(O) in anaerobic condition. Fold change of the pTrig copy numbers at the given concentrations of FCN(R) or FCN(O) were plotted. d, Various aTc concentrations and (e) induction time for the expression of cas1 and cas2 genes were tested for CRISPR array expansion. f, Various FCN(R) and FCN(O) concentrations were tested for pTrig copy number induction and (g) pTrig-derived spacer incorporation. The proportions of pTrig-derived spacers among all newly incorporated spacers are displayed. All measurements are based on three biological replicates. Error bars represent s.d. of three biological replicates.

Extended Data Fig. 2 Construction of a multi-channel electrochemical redox controller.

a, In an anaerobic chamber, a Raspberry Pi controls 3 of 8-channel relay modules (total 24 relays), which turn on or off electrical signals into each chamber pair from a power supply, based on a python script running on a wirelessly connected PC. b, A pair of working and counter chambers is connected by an agar salt bridge. In a working chamber, cells are incubated in M9 minimal medium supplemented with antibiotics, aTc, FCN(R) and PMS. M9 minimal medium supplemented with FCN(O) and PMS is filled in another chamber (counter). c, A photograph of the multi-channel electrochemical redox controller in an anaerobic chamber. d, Changes in electrochemical redox states of FCN(R) in a working chamber (left) and FCN(O) in a counter chamber (right) measured by absorbance at 420 nm with (0.5 V) and without (0.0 V) electronic signals. All measurements are based on three replicates. Error bars represent s.d. of three replicates.

Extended Data Fig. 3 Encoding of 3-bit binary data profiles.

a, Schematic diagram of experimental steps for multi-round encoding. After each round of electrical stimulation, the cell population was recovered in the rich medium (LB) aerobically so that the induced/uninduced plasmid copy number in the previous encoding round can be diluted out and reset low. b, To determine the recovery condition, anaerobic and aerobic conditions were compared. c, Overlaid distributions of the plasmid copy numbers with/without signals at each round over the course of the multi-round encoding (Fig. 2b). d, CRISPR array expansion over the course of the experiment. e, The 3-bit binary data profiles are grouped by the number of electronic signals, and the proportions of pTrig-derived spacers among all newly incorporated spacers are displayed. f, To enrich the sequencing reads for expanded arrays with more new spacers (longer arrays), the magnetic bead-based size enrichment was performed. Frequency of arrays of different lengths (unexpanded and L1-L4) with and without size enrichment are plotted. g, Principal component analysis on the array-type frequency profiles for the 3-bit digital data profiles. All 9 independent biological replicates are shown for each 3-bit digital data profiles. The first three independent datasets used for training of the Random Forest classifier are highlighted. All measurements are based on two or more biological replicates. Error bars represent s.d. of three or more biological replicates.

Extended Data Fig. 4 Performance of a Random Forest classifier for data reconstruction.

a, Confusion matrix from cross validation of the Random Forest classifier for 10 times by training on randomly selected 2 datasets for each 3-bit digital data profile from the 3 independent experiments and testing the trained model on the left-out 1 dataset. b, Importance of features (array-types) for the Random Forest classifier in Fig. 2f. c, Classification performance for the number of CRISPR arrays. CRISPR arrays with new uniquely mapping spacers were randomly subsampled to the various numbers for the 3-bit digital data profiles and classifications were performed. Recall accuracies for distinguishing 8 different types of 3-bit digital data profiles were displayed as a function of the number of expanded arrays with uniquely mapping spacers (grey: all arrays, red: L2/L3 arrays). The number of sequencing reads corresponding to the number of expanded arrays with uniquely mapping spacers (grey: all arrays) is also provided as an additional x-axis. Shaded regions represent 95% confidence interval of 10 iterations of subsampling and classification. d, Recall accuracies for distinguishing 8 different types of 3-bit digital data profiles with varying proportions of randomly selected training datasets for each 3-bit digital data profile. Shaded regions represent 95% confidence interval of 100 iterations of subsampling and classification.

Extended Data Fig. 5 Barcoding CRISPR arrays for multiplexed encoding.

a, CRISPR arrays can be barcoded with 8-bp unique sequences either downstream of the 1^st spacer region or within direct repeat (DR) region. b, CRISPR array expansion rates (relative to wild-type array) of 69 DR-barcoded CRISPR arrays and 24 spacer-barcoded CRISPR arrays. c, Distribution of array expansion rates of spacer-barcoded CRISPR arrays is much more uniform and consistent than that of DR-barcoded CRISPR arrays. A DR variant (d1) that was more efficient than the wild-type DR sequence in the initial 96-well plate-based test is highlighted. d, The d1 DR variant was tested again in tube culture condition. In tube culture condition, however, the DR variant did not show significantly higher activity than that of the wild-type DR sequence. e, Comparison of CRISPR array expansion rates measured individually or in pool. Shaded region represents 95% confidence interval for linear regression (dashed grey line). Sample sizes (n) and Person correlation coefficient (r) are shown. All measurements are based on three biological replicates. Error bars represent s.d. of three biological replicates.

Extended Data Fig. 6 Projections on the scale of DRIVES.

a, Data storage capacity (‘n’ bits of information or ‘n’ rounds of encoding) per cell population is estimated as a function of Cas1-Cas2 activity (‘X’ proportion of the cell population expanded arrays with a new spacer after a single round of encoding). Here, ‘Xⁿ’ proportion of the cell population would have expanded arrays every round resulting ‘n’ new spacers (Ln arrays) after ‘n’ rounds of encoding, and we assumed that the sampling capacity for the Ln array population governs the data storage capacity. We considered various sampling depths ‘D’, where ‘D’ proportion of the cell population can be sufficiently sampled. This ‘D’ could be affected by many factors including the sequencing depth and size enrichment efficiency. We assumed that if the ‘Xⁿ’ is same or higher than the given sampling depth constraint ‘D’, ‘n’ bits can be stored and reliably decoded. For example, when 0.001 of the cell population can be sufficiently sampled (D=0.001), maximum data storage capacity would be 3 bits (n=3) with the current Cas1-Cas2 activity level (X=0.1) as in our current experimental dataset (highlighted in red in the plot). And when 0.0001 of the cell population can be sufficiently sampled (D=0.0001), maximum data storage capacity would be 4 bits (n=4) with the current Cas1-Cas2 activity level (X=0.1). Although the Illumina MiSeq v2 300 cycles kit used in this study can read only up to 5 new spacers, we assumed that sequencing read length is not the limiting factor in this projection as other long read sequencing technologies could be employed. b, Estimated total data storage capacity across barcoded cell populations as a function of Cas1-Cas2 activity and the number of parallel channels in the culture platform at two different sampling depths (D=0.001 and D=0.00001). A larger data per cell population would require more rounds of encoding which takes longer time, and a larger number of parallel channels would require more barcoded cell populations and more sophisticated design of the culture platform. Current capacity of the system with 24 channels in the culture platform is highlighted in blue in the plot.

Extended Data Fig. 7 Design of 6-bit encoding tables for text messages.

a, Probability of correct classification for each of the 3-bit digital data profiles by the Random Forest classifier on the newly generated independent datasets is calculated based on the result in Fig. 2f. b, DEC and OPT encoding tables with estimated probabilities of correct classification for the 64 characters. OPT 6-bit encoding table was designed by considering the correct classification probability and the usage frequency of the characters (https://mdickens.me/typing/letter_frequency.html). c, Probability of correct decoding for the 64 characters (ordered by usage) with DEC and OPT 6-bit encoding tables. d, Comparison of predicted probabilities of correct decoding for various text messages based on the two encoding tables. The predicted probabilities of correct decoding for each character or text message were calculated by multiplying the correct decoding probability values of each 3-bit digital data profile units.

Extended Data Fig. 8 Reading ‘hello world!’ from subsampled sequencing reads.

Sequencing reads from each barcode in the ‘hello world!’-encoded cell population using OPT table were randomly subsampled to the various numbers and classifications were performed. Recall accuracies for (a) distinguishing 3-bit digital data profiles for 24 barcoded populations or for (b) calling correct bits out of 72 bits were displayed as a function of the number of expanded arrays with uniquely mapping spacers (grey: all arrays, red: L2/L3 arrays). The number of sequencing reads corresponding to the number of expanded arrays with uniquely mapping spacers (grey: all arrays) is also provided as an additional x-axis. Shaded regions represent 95% confidence interval of 10 iterations of subsampling and classification.

Extended Data Fig. 9 Improving data reconstruction with error correction.

a, By using every sixth bit as a check point (checksum) for the first 5 bits, errors in data reconstruction can be detected and corrected for the selected 32 combinations of 6-bit digital data profiles based on the classifier’s confusion probability in Fig. 2f and Extended Data Fig. 9b. For example, for a digital input ‘011110’ could be classified as ‘011110’, ‘011010’, ‘001110’, or ‘001010’ with the probabilities of 69%, 14%, 14%, or 3%, respectively. Out of these 4 possible initial classifications, the last 3 are wrong and the 2 wrong classifications with a single bit error can be detected by the check point values and fixed. However, the classification result with 2 bits error cannot be detected by the check point value and therefore cannot be fixed. For all 32 combinations of 6-bit digital data profiles, possible classification results, their probabilities, and whether they can be fixed or not are summarized in Supplementary Table 2. b, Confusion probability for each of the 3-bit digital data profiles based on Fig. 2f. c, The check point values for each combination of eight 3-bit and four 2-bit digital data profiles. d, OPT2 encoding table with the estimated probabilities of correct classification for the 32 characters. e, Probability of correct decoding for the 32 characters (ordered by usage) for OPT and OPT2 6-bit encoding tables. f, ‘synbio@cu’ encoded in the genomes of barcoded E. coli populations using the OPT2 error correction strategy. Two errors from the initial classification were detected using the check points and successfully corrected as described in the figure. For classification of each barcoded cell population, an average of 492,289 total sequencing reads with 268,066 reads of expanded arrays (or 106,242 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. Frequencies of array-types are in log₁₀ scale. All measurements are based on a single experimental study.

Extended Data Fig. 10 Data stability in replicating cells.

A mixed pool of 24 barcoded cell population encoded with a 72-bit text message ‘hello world!’ in Fig. 3 was subsequently diluted 1:100 every 24 hours into 3 mL fresh LB media with antibiotic for a total of 16 days (~106 generation, ~6.6 generations per day). a, Data stability in the propagating cell population over 100 generations. Accuracy indicates the proportion of bits that are correctly classified. >90% of the 72 bits could be correctly retrieved up to ~80 generations. Shaded region represents s.d. of three biological replicates. For classification of each barcoded cell population, an average of 82,860 of total sequencing reads with 40,502 reads of expanded arrays (or 17,139 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. b, Gradual changes in the relative abundance of 24 barcoded cell population over time suggests adaptive mutations with fitness effects arising in some of the subpopulation. Samples were collected at the time points indicated by arrows (day 0, 4, 6, 8, 12, and 16). All measurements are based on three biological replicates.

Supplementary information

Supplementary Information

Supplementary Figs. 1–6 and Tables 1–3.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yim, S.S., McBee, R.M., Song, A.M. et al. Robust direct digital-to-biological data storage in living cells. Nat Chem Biol 17, 246–253 (2021). https://doi.org/10.1038/s41589-020-00711-4

Download citation

Received: 21 April 2020
Revised: 30 October 2020
Accepted: 12 November 2020
Published: 11 January 2021
Issue Date: March 2021
DOI: https://doi.org/10.1038/s41589-020-00711-4

This article is cited by

A designer synthetic chromosome fragment functions in moss
- Lian-Ge Chen
- Tianlong Lan
- Yuling Jiao
Nature Plants (2024)
DNA as a universal chemical substrate for computing and data storage
- Shuo Yang
- Bas W. A. Bögels
- Tom F. A. de Greef
Nature Reviews Chemistry (2024)
Temporally resolved transcriptional recording in E. coli DNA using a Retro-Cascorder
- Sierra K. Lear
- Santiago C. Lopez
- Seth L. Shipman
Nature Protocols (2023)
A biological camera that captures and stores images directly into DNA
- Cheng Kai Lim
- Jing Wui Yeoh
- Chueh Loo Poh
Nature Communications (2023)
An optogenetic toolkit for light-inducible antibiotic resistance
- Michael B. Sheets
- Nathan Tague
- Mary J. Dunlop
Nature Communications (2023)