Robust direct digital-to-biological data storage in living cells

Abstract

DNA has been the predominant information storage medium for biology and holds great promise as a next-generation high-density data medium in the digital era. Currently, the vast majority of DNA-based data storage approaches rely on in vitro DNA synthesis. As such, there are limited methods to encode digital data into the chromosomes of living cells in a single step. Here, we describe a new electrogenetic framework for direct storage of digital data in living cells. Using an engineered redox-responsive CRISPR adaptation system, we encoded binary data in 3-bit units into CRISPR arrays of bacterial cells by electrical stimulation. We demonstrate multiplex data encoding into barcoded cell populations to yield meaningful information storage and capacity up to 72 bits, which can be maintained over many generations in natural open environments. This work establishes a direct digital-to-biological data storage framework and advances our capacity for information exchange between silicon- and carbon-based entities.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Direct digital-to-biological data storage into CRISPR arrays.
Fig. 2: Encoding 3-bit binary data into Escherichia coli populations.
Fig. 3: Writing the text message ‘hello world!’ containing 72 bits into barcoded E. coli cells.
Fig. 4: Cell envelope as a physical barrier to protect data.

Data availability

All data supporting the findings of this study are available within the Article and its Supplementary Information or are available from the authors upon request. Sequencing data associated with this study are available at NCBI SRA under PRJNA625964.

Code availability

All of the CRISPR spacer extraction and mapping software can be accessed at https://github.com/ravisheth/trace or are available from the authors upon request.

References

  1. 1.

    Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012).

    CAS  Article  Google Scholar 

  2. 2.

    Erlich, Y. & Zielinski, D. DNA fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).

    CAS  PubMed  Google Scholar 

  3. 3.

    Allentoft, M. E. et al. The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proc. Biol. Sci. 279, 4724–4733 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Newman, S. et al. High density DNA data storage library via dehydration with digital microfluidic retrieval. Nat. Commun. 10, 1706 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).

    CAS  PubMed  Google Scholar 

  7. 7.

    Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol. 37, 1229–1236 (2019).

    CAS  PubMed  Google Scholar 

  8. 8.

    Lee, H. H., Kalhor, R., Goela, N., Bolot, J. & Church, G. M. Terminator-free template-independent enzymatic DNA synthesis for digital information storage. Nat. Commun. 10, 2383 (2019).

    PubMed  PubMed Central  Google Scholar 

  9. 9.

    Farzadfard, F. & Lu, T. K. Emerging applications for DNA writers and molecular recorders. Science 361, 870–875 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Sheth, R. U. & Wang, H. H. DNA-based memory devices for recording cellular events. Nat. Rev. Genet. 19, 718–732 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Chan, M. M. et al. Molecular recording of mammalian embryogenesis. Nature 570, 77–82 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Kalhor, R. et al. Developmental barcoding of whole mouse via homing CRISPR. Science 361, eaat9804 (2018).

    PubMed  PubMed Central  Google Scholar 

  13. 13.

    McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016).

    PubMed  PubMed Central  Google Scholar 

  14. 14.

    Munck, C., Sheth, R. U., Freedberg, D. E. & Wang, H. H. Recording mobile DNA in the gut microbiota using an Escherichia coli CRISPR-Cas spacer acquisition platform. Nat. Commun. 11, 95 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Farzadfard, F. & Lu, T. K. Genomically encoded analog memory with precise in vivo DNA writing in living cell populations. Science 346, 1256272 (2014).

    PubMed  PubMed Central  Google Scholar 

  16. 16.

    Loveless, T. B. et al. DNA writing at a single genomic site enables lineage tracing and analog recording in mammalian cells. Preprint at bioRxiv https://doi.org/10.1101/639120 (2019).

  17. 17.

    Schmidt, F., Cherepkova, M. Y. & Platt, R. J. Transcriptional recording by CRISPR spacer acquisition from RNA. Nature 562, 380–385 (2018).

    CAS  PubMed  Google Scholar 

  18. 18.

    Tang, W. & Liu, D. R. Rewritable multi-event analog recording in bacterial and mammalian cells. Science 360, eaap8992 (2018).

    PubMed  PubMed Central  Google Scholar 

  19. 19.

    Yang, L. et al. Permanent genetic memory with >1-byte capacity. Nat. Methods 11, 1261–1266 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Mimee, M., Tucker, A. C., Voigt, C. A. & Lu, T. K. Programming a human commensal bacterium, Bacteroides thetaiotaomicron, to sense and respond to stimuli in the murine gut microbiota. Cell Syst. 1, 62–71 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Riglar, D. T. et al. Engineered bacteria can function in the mammalian gut long-term as live diagnostics of inflammation. Nat. Biotechnol. 35, 653–658 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Sheth, R. U., Yim, S. S., Wu, F. L. & Wang, H. H. Multiplex recording of cellular events over time on CRISPR biological tape. Science 358, 1457–1461 (2017).

    CAS  PubMed  Google Scholar 

  23. 23.

    Roquet, N., Soleimany, A. P., Ferris, A. C., Aaronson, S. & Lu, T. K. Synthetic recombinase-based state machines in living cells. Science 353, aad8559 (2016).

    PubMed  Google Scholar 

  24. 24.

    Farzadfard, F. et al. Single-nucleotide-resolution computing and memory in living cells. Mol. Cell 75, 769–780 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Akhmetov, A., Ellington, A. D. & Marcotte, E. M. A highly parallel strategy for storage of digital information in living cells. BMC Biotechnol. 18, 64 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Shipman, S. L., Nivala, J., Macklis, J. D. & Church, G. M. CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature 547, 345–349 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Liu, Y. et al. Connecting biology to electronics: molecular communication via redox modality. Adv. Healthc. Mater. 6 (2017); https://doi.org/10.1002/adhm.201700789

  28. 28.

    Mimee, M. et al. An ingestible bacterial-electronic system to monitor gastrointestinal health. Science 360, 915–918 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Weber, W. et al. A synthetic mammalian electro-genetic transcription circuit. Nucleic Acids Res. 37, e33 (2009).

    PubMed  PubMed Central  Google Scholar 

  30. 30.

    Bellin, D. L. et al. Electrochemical camera chip for simultaneous imaging of multiple metabolites in biofilms. Nat. Commun. 7, 10535 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    VanArsdale, E. et al. A co-culture based tyrosine-tyrosinase electrochemical gene circuit for connecting cellular communication with electronic networks. ACS Synth. Biol 9, 1117–1128 (2020).

    CAS  PubMed  Google Scholar 

  32. 32.

    Gordonov, T. et al. Electronic modulation of biochemical signal generation. Nat. Nanotechnol. 9, 605–610 (2014).

    CAS  PubMed  Google Scholar 

  33. 33.

    Tschirhart, T. et al. Electronic control of gene expression and cell behaviour in Escherichia coli through redox signalling. Nat. Commun. 8, 14030 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Bhokisham, N. et al. A redox-based electrogenetic CRISPR system to connect with and control biological information networks. Nat. Commun. 11, 2427 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Nunez, J. K., Lee, A. S., Engelman, A. & Doudna, J. A. Integrase-mediated spacer acquisition during CRISPR-Cas adaptive immunity. Nature 519, 193–198 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Michel, J. B. et al. Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011).

    CAS  PubMed  Google Scholar 

  37. 37.

    Nivala, J., Shipman, S. L. & Church, G. M. Spontaneous CRISPR loci generation in vivo by non-canonical spacer integration. Nat. Microbiol. 3, 310–318 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Din, M. O., Martin, A., Razinkov, I., Csicsery, N. & Hasty, J. Interfacing gene circuits with microelectronics through engineered population dynamics. Sci. Adv. 6, eaaz8344 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Fernandez-Rodriguez, J., Moser, F., Song, M. & Voigt, C. A. Engineering RGB color vision into Escherichia coli. Nat. Chem. Biol. 13, 706–708 (2017).

    CAS  PubMed  Google Scholar 

  40. 40.

    Piraner, D. I., Abedi, M. H., Moser, B. A., Lee-Gosselin, A. & Shapiro, M. G. Tunable thermal bioswitches for in vivo control of microbial therapeutics. Nat. Chem. Biol. 13, 75–80 (2017).

    CAS  PubMed  Google Scholar 

  41. 41.

    Makarova, K. S. et al. Evolutionary classification of CRISPR-Cas systems: a burst of class 2 and derived variants. Nat. Rev. Microbiol. 18, 67–83 (2020).

    CAS  PubMed  Google Scholar 

  42. 42.

    Heler, R. et al. Mutations in Cas9 enhance the rate of acquisition of viral spacer sequences during the CRISPR-Cas immune response. Mol. Cell 65, 168–175 (2017).

    CAS  PubMed  Google Scholar 

  43. 43.

    Shipman, S. L., Nivala, J., Macklis, J. D. & Church, G. M. Molecular recordings by directed CRISPR spacer acquisition. Science 353, aaf1175 (2016).

    PubMed  PubMed Central  Google Scholar 

  44. 44.

    Wright, A. V. et al. A functional mini-integrase in a two-protein type V-C CRISPR system. Mol. Cell 73, 727–737 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Blazejewski, T., Ho, H.-I. & Wang, H. H. Synthetic sequence entanglement augments stability and containment of genetic information in cells. Science 365, 595–598 (2019).

    CAS  PubMed  Google Scholar 

  46. 46.

    Deatherage, D. E., Leon, D., Rodriguez, A. E., Omar, S. K. & Barrick, J. E. Directed evolution of Escherichia coli with lower-than-natural plasmid mutation rates. Nucleic Acids Res. 46, 9236–9250 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. 47.

    Cox, J. P. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).

    CAS  PubMed  Google Scholar 

  48. 48.

    Li, F. et al. Modular engineering to increase intracellular NAD (H/+) promotes rate of extracellular electron transfer of Shewanella oneidensis. Nat. Commun. 9, 3637 (2018).

    PubMed  PubMed Central  Google Scholar 

  49. 49.

    Lee, H. H. et al. Functional genomics of the rapidly replicating bacterium Vibrio natriegens by CRISPRi. Nat. Microbiol. 4, 1105–1113 (2019).

    CAS  PubMed  Google Scholar 

  50. 50.

    Davis, J. et al. In vivo multi-dimensional information-keeping in Halobacterium salinarum. Preprint at bioRxiv https://doi.org/10.1101/2020.02.14.949925 (2020).

  51. 51.

    Sharan, S. K., Thomason, L. C., Kuznetsov, S. G. & Court, D. L. Recombineering: a homologous recombination-based method of genetic engineering. Nat. Protoc. 4, 206–223 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    St-Pierre, F. et al. One-step cloning and chromosomal integration of DNA. ACS Synth. Biol. 2, 537–541 (2013).

    CAS  PubMed  Google Scholar 

  53. 53.

    Ji, B. W. et al. Quantifying spatiotemporal variability and noise in absolute microbiota abundances using replicate sampling. Nat. Methods 16, 731–736 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54.

    Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).

    CAS  Google Scholar 

  55. 55.

    Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank A. Kaufman and members of the Wang laboratory for advice and comments on the manuscript. H.H.W. acknowledges funding support from ONR (N00014-17-1-2353), NSF (MCB‐1453219), NIH (1R01AI132403-01) and the Burroughs Wellcome Fund (PATH1016691). S.S.Y. is grateful for support from the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education (NRF-2017R1A6A3A03003401). R.U.S. was supported by a Fannie and John Hertz Foundation Fellowship and an NSF Graduate Research Fellowship (DGE-1644869).

Author information

Affiliations

Authors

Contributions

S.S.Y., R.U.S. and H.H.W. developed the initial concept. S.S.Y. performed experiments and analyzed the results under the supervision of H.H.W. S.S.Y., R.M.M. and A.M.S. designed and constructed the electrochemical redox controller set-up. S.S.Y. and Y.H. designed the error correction pipeline. S.S.Y. and H.H.W. wrote the mansucript, with input from all authors.

Corresponding author

Correspondence to Harris H. Wang.

Ethics declarations

Competing interests

H.H.W. is a scientific advisor to SNIPR Biome. The other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Development of a redox-sensing DNA-based cellular recorder for direct digital-to-biological data storage.

This system is composed of two distinct modules: (i) a ‘sensing module’ that converts a desired biological signal into a change in copy number of a trigger plasmid (pTrig), and (ii) a ‘writing module’ that overexpresses Cas1-Cas2 from a recording plasmid (pRec) to unidirectionally expand genomic CRISPR arrays with novel ~33 bp spacers acquired from genomic or plasmid DNA sources in the cell. In the presence of the desired signal, cells experience a shift in their intracellular DNA pool, driven by an increase in pTrig copy number, which results in an acquisition bias for pTrig-derived spacers amongst expanding CRISPR arrays. a, The lacI gene in the previous pRec22 was replaced with soxR gene from E. coli, and the lac promoter in the previous pTrig22 was replaced with soxS promoter from E. coli. P1 replication system is inactive in the absence of oxidative stress, and a mini-F origin keeps the pTrig plasmid copy number low. Upon induction with oxidative stress, SoxR detaches from soxS promoter and activates the P1 replication system to increase the copy number of the plasmid. b, pTrig copy number in the presence of various concentrations of phenazine methosulfate (PMS) in aerobic condition. pRec (with an additional copy of soxR gene) helps get higher fold-change of pTrig copy number by more efficient repression in absence of the inducer. c, pTrig copy numbers in the presence of pRec and various concentrations of PMS, and FCN(R) or FCN(O) in anaerobic condition. Fold change of the pTrig copy numbers at the given concentrations of FCN(R) or FCN(O) were plotted. d, Various aTc concentrations and (e) induction time for the expression of cas1 and cas2 genes were tested for CRISPR array expansion. f, Various FCN(R) and FCN(O) concentrations were tested for pTrig copy number induction and (g) pTrig-derived spacer incorporation. The proportions of pTrig-derived spacers among all newly incorporated spacers are displayed. All measurements are based on three biological replicates. Error bars represent s.d. of three biological replicates.

Extended Data Fig. 2 Construction of a multi-channel electrochemical redox controller.

a, In an anaerobic chamber, a Raspberry Pi controls 3 of 8-channel relay modules (total 24 relays), which turn on or off electrical signals into each chamber pair from a power supply, based on a python script running on a wirelessly connected PC. b, A pair of working and counter chambers is connected by an agar salt bridge. In a working chamber, cells are incubated in M9 minimal medium supplemented with antibiotics, aTc, FCN(R) and PMS. M9 minimal medium supplemented with FCN(O) and PMS is filled in another chamber (counter). c, A photograph of the multi-channel electrochemical redox controller in an anaerobic chamber. d, Changes in electrochemical redox states of FCN(R) in a working chamber (left) and FCN(O) in a counter chamber (right) measured by absorbance at 420 nm with (0.5 V) and without (0.0 V) electronic signals. All measurements are based on three replicates. Error bars represent s.d. of three replicates.

Extended Data Fig. 3 Encoding of 3-bit binary data profiles.

a, Schematic diagram of experimental steps for multi-round encoding. After each round of electrical stimulation, the cell population was recovered in the rich medium (LB) aerobically so that the induced/uninduced plasmid copy number in the previous encoding round can be diluted out and reset low. b, To determine the recovery condition, anaerobic and aerobic conditions were compared. c, Overlaid distributions of the plasmid copy numbers with/without signals at each round over the course of the multi-round encoding (Fig. 2b). d, CRISPR array expansion over the course of the experiment. e, The 3-bit binary data profiles are grouped by the number of electronic signals, and the proportions of pTrig-derived spacers among all newly incorporated spacers are displayed. f, To enrich the sequencing reads for expanded arrays with more new spacers (longer arrays), the magnetic bead-based size enrichment was performed. Frequency of arrays of different lengths (unexpanded and L1-L4) with and without size enrichment are plotted. g, Principal component analysis on the array-type frequency profiles for the 3-bit digital data profiles. All 9 independent biological replicates are shown for each 3-bit digital data profiles. The first three independent datasets used for training of the Random Forest classifier are highlighted. All measurements are based on two or more biological replicates. Error bars represent s.d. of three or more biological replicates.

Extended Data Fig. 4 Performance of a Random Forest classifier for data reconstruction.

a, Confusion matrix from cross validation of the Random Forest classifier for 10 times by training on randomly selected 2 datasets for each 3-bit digital data profile from the 3 independent experiments and testing the trained model on the left-out 1 dataset. b, Importance of features (array-types) for the Random Forest classifier in Fig. 2f. c, Classification performance for the number of CRISPR arrays. CRISPR arrays with new uniquely mapping spacers were randomly subsampled to the various numbers for the 3-bit digital data profiles and classifications were performed. Recall accuracies for distinguishing 8 different types of 3-bit digital data profiles were displayed as a function of the number of expanded arrays with uniquely mapping spacers (grey: all arrays, red: L2/L3 arrays). The number of sequencing reads corresponding to the number of expanded arrays with uniquely mapping spacers (grey: all arrays) is also provided as an additional x-axis. Shaded regions represent 95% confidence interval of 10 iterations of subsampling and classification. d, Recall accuracies for distinguishing 8 different types of 3-bit digital data profiles with varying proportions of randomly selected training datasets for each 3-bit digital data profile. Shaded regions represent 95% confidence interval of 100 iterations of subsampling and classification.

Extended Data Fig. 5 Barcoding CRISPR arrays for multiplexed encoding.

a, CRISPR arrays can be barcoded with 8-bp unique sequences either downstream of the 1st spacer region or within direct repeat (DR) region. b, CRISPR array expansion rates (relative to wild-type array) of 69 DR-barcoded CRISPR arrays and 24 spacer-barcoded CRISPR arrays. c, Distribution of array expansion rates of spacer-barcoded CRISPR arrays is much more uniform and consistent than that of DR-barcoded CRISPR arrays. A DR variant (d1) that was more efficient than the wild-type DR sequence in the initial 96-well plate-based test is highlighted. d, The d1 DR variant was tested again in tube culture condition. In tube culture condition, however, the DR variant did not show significantly higher activity than that of the wild-type DR sequence. e, Comparison of CRISPR array expansion rates measured individually or in pool. Shaded region represents 95% confidence interval for linear regression (dashed grey line). Sample sizes (n) and Person correlation coefficient (r) are shown. All measurements are based on three biological replicates. Error bars represent s.d. of three biological replicates.

Extended Data Fig. 6 Projections on the scale of DRIVES.

a, Data storage capacity (‘n’ bits of information or ‘n’ rounds of encoding) per cell population is estimated as a function of Cas1-Cas2 activity (‘X’ proportion of the cell population expanded arrays with a new spacer after a single round of encoding). Here, ‘Xn’ proportion of the cell population would have expanded arrays every round resulting ‘n’ new spacers (Ln arrays) after ‘n’ rounds of encoding, and we assumed that the sampling capacity for the Ln array population governs the data storage capacity. We considered various sampling depths ‘D’, where ‘D’ proportion of the cell population can be sufficiently sampled. This ‘D’ could be affected by many factors including the sequencing depth and size enrichment efficiency. We assumed that if the ‘Xn’ is same or higher than the given sampling depth constraint ‘D’, ‘n’ bits can be stored and reliably decoded. For example, when 0.001 of the cell population can be sufficiently sampled (D=0.001), maximum data storage capacity would be 3 bits (n=3) with the current Cas1-Cas2 activity level (X=0.1) as in our current experimental dataset (highlighted in red in the plot). And when 0.0001 of the cell population can be sufficiently sampled (D=0.0001), maximum data storage capacity would be 4 bits (n=4) with the current Cas1-Cas2 activity level (X=0.1). Although the Illumina MiSeq v2 300 cycles kit used in this study can read only up to 5 new spacers, we assumed that sequencing read length is not the limiting factor in this projection as other long read sequencing technologies could be employed. b, Estimated total data storage capacity across barcoded cell populations as a function of Cas1-Cas2 activity and the number of parallel channels in the culture platform at two different sampling depths (D=0.001 and D=0.00001). A larger data per cell population would require more rounds of encoding which takes longer time, and a larger number of parallel channels would require more barcoded cell populations and more sophisticated design of the culture platform. Current capacity of the system with 24 channels in the culture platform is highlighted in blue in the plot.

Extended Data Fig. 7 Design of 6-bit encoding tables for text messages.

a, Probability of correct classification for each of the 3-bit digital data profiles by the Random Forest classifier on the newly generated independent datasets is calculated based on the result in Fig. 2f. b, DEC and OPT encoding tables with estimated probabilities of correct classification for the 64 characters. OPT 6-bit encoding table was designed by considering the correct classification probability and the usage frequency of the characters (https://mdickens.me/typing/letter_frequency.html). c, Probability of correct decoding for the 64 characters (ordered by usage) with DEC and OPT 6-bit encoding tables. d, Comparison of predicted probabilities of correct decoding for various text messages based on the two encoding tables. The predicted probabilities of correct decoding for each character or text message were calculated by multiplying the correct decoding probability values of each 3-bit digital data profile units.

Extended Data Fig. 8 Reading ‘hello world!’ from subsampled sequencing reads.

Sequencing reads from each barcode in the ‘hello world!’-encoded cell population using OPT table were randomly subsampled to the various numbers and classifications were performed. Recall accuracies for (a) distinguishing 3-bit digital data profiles for 24 barcoded populations or for (b) calling correct bits out of 72 bits were displayed as a function of the number of expanded arrays with uniquely mapping spacers (grey: all arrays, red: L2/L3 arrays). The number of sequencing reads corresponding to the number of expanded arrays with uniquely mapping spacers (grey: all arrays) is also provided as an additional x-axis. Shaded regions represent 95% confidence interval of 10 iterations of subsampling and classification.

Extended Data Fig. 9 Improving data reconstruction with error correction.

a, By using every sixth bit as a check point (checksum) for the first 5 bits, errors in data reconstruction can be detected and corrected for the selected 32 combinations of 6-bit digital data profiles based on the classifier’s confusion probability in Fig. 2f and Extended Data Fig. 9b. For example, for a digital input ‘011110’ could be classified as ‘011110’, ‘011010’, ‘001110’, or ‘001010’ with the probabilities of 69%, 14%, 14%, or 3%, respectively. Out of these 4 possible initial classifications, the last 3 are wrong and the 2 wrong classifications with a single bit error can be detected by the check point values and fixed. However, the classification result with 2 bits error cannot be detected by the check point value and therefore cannot be fixed. For all 32 combinations of 6-bit digital data profiles, possible classification results, their probabilities, and whether they can be fixed or not are summarized in Supplementary Table 2. b, Confusion probability for each of the 3-bit digital data profiles based on Fig. 2f. c, The check point values for each combination of eight 3-bit and four 2-bit digital data profiles. d, OPT2 encoding table with the estimated probabilities of correct classification for the 32 characters. e, Probability of correct decoding for the 32 characters (ordered by usage) for OPT and OPT2 6-bit encoding tables. f, ‘synbio@cu’ encoded in the genomes of barcoded E. coli populations using the OPT2 error correction strategy. Two errors from the initial classification were detected using the check points and successfully corrected as described in the figure. For classification of each barcoded cell population, an average of 492,289 total sequencing reads with 268,066 reads of expanded arrays (or 106,242 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. Frequencies of array-types are in log10 scale. All measurements are based on a single experimental study.

Extended Data Fig. 10 Data stability in replicating cells.

A mixed pool of 24 barcoded cell population encoded with a 72-bit text message ‘hello world!’ in Fig. 3 was subsequently diluted 1:100 every 24 hours into 3 mL fresh LB media with antibiotic for a total of 16 days (~106 generation, ~6.6 generations per day). a, Data stability in the propagating cell population over 100 generations. Accuracy indicates the proportion of bits that are correctly classified. >90% of the 72 bits could be correctly retrieved up to ~80 generations. Shaded region represents s.d. of three biological replicates. For classification of each barcoded cell population, an average of 82,860 of total sequencing reads with 40,502 reads of expanded arrays (or 17,139 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. b, Gradual changes in the relative abundance of 24 barcoded cell population over time suggests adaptive mutations with fitness effects arising in some of the subpopulation. Samples were collected at the time points indicated by arrows (day 0, 4, 6, 8, 12, and 16). All measurements are based on three biological replicates.

Supplementary information

Supplementary Information

Supplementary Figs. 1–6 and Tables 1–3.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yim, S.S., McBee, R.M., Song, A.M. et al. Robust direct digital-to-biological data storage in living cells. Nat Chem Biol (2021). https://doi.org/10.1038/s41589-020-00711-4

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing