DNA has been the predominant information storage medium for biology and holds great promise as a next-generation high-density data medium in the digital era. Currently, the vast majority of DNA-based data storage approaches rely on in vitro DNA synthesis. As such, there are limited methods to encode digital data into the chromosomes of living cells in a single step. Here, we describe a new electrogenetic framework for direct storage of digital data in living cells. Using an engineered redox-responsive CRISPR adaptation system, we encoded binary data in 3-bit units into CRISPR arrays of bacterial cells by electrical stimulation. We demonstrate multiplex data encoding into barcoded cell populations to yield meaningful information storage and capacity up to 72 bits, which can be maintained over many generations in natural open environments. This work establishes a direct digital-to-biological data storage framework and advances our capacity for information exchange between silicon- and carbon-based entities.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Nature Communications Open Access 03 July 2023
Nature Communications Open Access 23 February 2023
CCF Transactions on High Performance Computing Open Access 18 March 2022
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
All of the CRISPR spacer extraction and mapping software can be accessed at https://github.com/ravisheth/trace or are available from the authors upon request.
Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012).
Erlich, Y. & Zielinski, D. DNA fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
Allentoft, M. E. et al. The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proc. Biol. Sci. 279, 4724–4733 (2012).
Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).
Newman, S. et al. High density DNA data storage library via dehydration with digital microfluidic retrieval. Nat. Commun. 10, 1706 (2019).
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol. 37, 1229–1236 (2019).
Lee, H. H., Kalhor, R., Goela, N., Bolot, J. & Church, G. M. Terminator-free template-independent enzymatic DNA synthesis for digital information storage. Nat. Commun. 10, 2383 (2019).
Farzadfard, F. & Lu, T. K. Emerging applications for DNA writers and molecular recorders. Science 361, 870–875 (2018).
Sheth, R. U. & Wang, H. H. DNA-based memory devices for recording cellular events. Nat. Rev. Genet. 19, 718–732 (2018).
Chan, M. M. et al. Molecular recording of mammalian embryogenesis. Nature 570, 77–82 (2019).
Kalhor, R. et al. Developmental barcoding of whole mouse via homing CRISPR. Science 361, eaat9804 (2018).
McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016).
Munck, C., Sheth, R. U., Freedberg, D. E. & Wang, H. H. Recording mobile DNA in the gut microbiota using an Escherichia coli CRISPR-Cas spacer acquisition platform. Nat. Commun. 11, 95 (2020).
Farzadfard, F. & Lu, T. K. Genomically encoded analog memory with precise in vivo DNA writing in living cell populations. Science 346, 1256272 (2014).
Loveless, T. B. et al. DNA writing at a single genomic site enables lineage tracing and analog recording in mammalian cells. Preprint at bioRxiv https://doi.org/10.1101/639120 (2019).
Schmidt, F., Cherepkova, M. Y. & Platt, R. J. Transcriptional recording by CRISPR spacer acquisition from RNA. Nature 562, 380–385 (2018).
Tang, W. & Liu, D. R. Rewritable multi-event analog recording in bacterial and mammalian cells. Science 360, eaap8992 (2018).
Yang, L. et al. Permanent genetic memory with >1-byte capacity. Nat. Methods 11, 1261–1266 (2014).
Mimee, M., Tucker, A. C., Voigt, C. A. & Lu, T. K. Programming a human commensal bacterium, Bacteroides thetaiotaomicron, to sense and respond to stimuli in the murine gut microbiota. Cell Syst. 1, 62–71 (2015).
Riglar, D. T. et al. Engineered bacteria can function in the mammalian gut long-term as live diagnostics of inflammation. Nat. Biotechnol. 35, 653–658 (2017).
Sheth, R. U., Yim, S. S., Wu, F. L. & Wang, H. H. Multiplex recording of cellular events over time on CRISPR biological tape. Science 358, 1457–1461 (2017).
Roquet, N., Soleimany, A. P., Ferris, A. C., Aaronson, S. & Lu, T. K. Synthetic recombinase-based state machines in living cells. Science 353, aad8559 (2016).
Farzadfard, F. et al. Single-nucleotide-resolution computing and memory in living cells. Mol. Cell 75, 769–780 (2019).
Akhmetov, A., Ellington, A. D. & Marcotte, E. M. A highly parallel strategy for storage of digital information in living cells. BMC Biotechnol. 18, 64 (2018).
Shipman, S. L., Nivala, J., Macklis, J. D. & Church, G. M. CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature 547, 345–349 (2017).
Liu, Y. et al. Connecting biology to electronics: molecular communication via redox modality. Adv. Healthc. Mater. 6 (2017); https://doi.org/10.1002/adhm.201700789
Mimee, M. et al. An ingestible bacterial-electronic system to monitor gastrointestinal health. Science 360, 915–918 (2018).
Weber, W. et al. A synthetic mammalian electro-genetic transcription circuit. Nucleic Acids Res. 37, e33 (2009).
Bellin, D. L. et al. Electrochemical camera chip for simultaneous imaging of multiple metabolites in biofilms. Nat. Commun. 7, 10535 (2016).
VanArsdale, E. et al. A co-culture based tyrosine-tyrosinase electrochemical gene circuit for connecting cellular communication with electronic networks. ACS Synth. Biol 9, 1117–1128 (2020).
Gordonov, T. et al. Electronic modulation of biochemical signal generation. Nat. Nanotechnol. 9, 605–610 (2014).
Tschirhart, T. et al. Electronic control of gene expression and cell behaviour in Escherichia coli through redox signalling. Nat. Commun. 8, 14030 (2017).
Bhokisham, N. et al. A redox-based electrogenetic CRISPR system to connect with and control biological information networks. Nat. Commun. 11, 2427 (2020).
Nunez, J. K., Lee, A. S., Engelman, A. & Doudna, J. A. Integrase-mediated spacer acquisition during CRISPR-Cas adaptive immunity. Nature 519, 193–198 (2015).
Michel, J. B. et al. Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011).
Nivala, J., Shipman, S. L. & Church, G. M. Spontaneous CRISPR loci generation in vivo by non-canonical spacer integration. Nat. Microbiol. 3, 310–318 (2018).
Din, M. O., Martin, A., Razinkov, I., Csicsery, N. & Hasty, J. Interfacing gene circuits with microelectronics through engineered population dynamics. Sci. Adv. 6, eaaz8344 (2020).
Fernandez-Rodriguez, J., Moser, F., Song, M. & Voigt, C. A. Engineering RGB color vision into Escherichia coli. Nat. Chem. Biol. 13, 706–708 (2017).
Piraner, D. I., Abedi, M. H., Moser, B. A., Lee-Gosselin, A. & Shapiro, M. G. Tunable thermal bioswitches for in vivo control of microbial therapeutics. Nat. Chem. Biol. 13, 75–80 (2017).
Makarova, K. S. et al. Evolutionary classification of CRISPR-Cas systems: a burst of class 2 and derived variants. Nat. Rev. Microbiol. 18, 67–83 (2020).
Heler, R. et al. Mutations in Cas9 enhance the rate of acquisition of viral spacer sequences during the CRISPR-Cas immune response. Mol. Cell 65, 168–175 (2017).
Shipman, S. L., Nivala, J., Macklis, J. D. & Church, G. M. Molecular recordings by directed CRISPR spacer acquisition. Science 353, aaf1175 (2016).
Wright, A. V. et al. A functional mini-integrase in a two-protein type V-C CRISPR system. Mol. Cell 73, 727–737 (2019).
Blazejewski, T., Ho, H.-I. & Wang, H. H. Synthetic sequence entanglement augments stability and containment of genetic information in cells. Science 365, 595–598 (2019).
Deatherage, D. E., Leon, D., Rodriguez, A. E., Omar, S. K. & Barrick, J. E. Directed evolution of Escherichia coli with lower-than-natural plasmid mutation rates. Nucleic Acids Res. 46, 9236–9250 (2018).
Cox, J. P. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).
Li, F. et al. Modular engineering to increase intracellular NAD (H/+) promotes rate of extracellular electron transfer of Shewanella oneidensis. Nat. Commun. 9, 3637 (2018).
Lee, H. H. et al. Functional genomics of the rapidly replicating bacterium Vibrio natriegens by CRISPRi. Nat. Microbiol. 4, 1105–1113 (2019).
Davis, J. et al. In vivo multi-dimensional information-keeping in Halobacterium salinarum. Preprint at bioRxiv https://doi.org/10.1101/2020.02.14.949925 (2020).
Sharan, S. K., Thomason, L. C., Kuznetsov, S. G. & Court, D. L. Recombineering: a homologous recombination-based method of genetic engineering. Nat. Protoc. 4, 206–223 (2009).
St-Pierre, F. et al. One-step cloning and chromosomal integration of DNA. ACS Synth. Biol. 2, 537–541 (2013).
Ji, B. W. et al. Quantifying spatiotemporal variability and noise in absolute microbiota abundances using replicate sampling. Nat. Methods 16, 731–736 (2019).
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267 (2007).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).
We thank A. Kaufman and members of the Wang laboratory for advice and comments on the manuscript. H.H.W. acknowledges funding support from ONR (N00014-17-1-2353), NSF (MCB‐1453219), NIH (1R01AI132403-01) and the Burroughs Wellcome Fund (PATH1016691). S.S.Y. is grateful for support from the Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education (NRF-2017R1A6A3A03003401). R.U.S. was supported by a Fannie and John Hertz Foundation Fellowship and an NSF Graduate Research Fellowship (DGE-1644869).
H.H.W. is a scientific advisor to SNIPR Biome. The other authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Development of a redox-sensing DNA-based cellular recorder for direct digital-to-biological data storage.
This system is composed of two distinct modules: (i) a ‘sensing module’ that converts a desired biological signal into a change in copy number of a trigger plasmid (pTrig), and (ii) a ‘writing module’ that overexpresses Cas1-Cas2 from a recording plasmid (pRec) to unidirectionally expand genomic CRISPR arrays with novel ~33 bp spacers acquired from genomic or plasmid DNA sources in the cell. In the presence of the desired signal, cells experience a shift in their intracellular DNA pool, driven by an increase in pTrig copy number, which results in an acquisition bias for pTrig-derived spacers amongst expanding CRISPR arrays. a, The lacI gene in the previous pRec22 was replaced with soxR gene from E. coli, and the lac promoter in the previous pTrig22 was replaced with soxS promoter from E. coli. P1 replication system is inactive in the absence of oxidative stress, and a mini-F origin keeps the pTrig plasmid copy number low. Upon induction with oxidative stress, SoxR detaches from soxS promoter and activates the P1 replication system to increase the copy number of the plasmid. b, pTrig copy number in the presence of various concentrations of phenazine methosulfate (PMS) in aerobic condition. pRec (with an additional copy of soxR gene) helps get higher fold-change of pTrig copy number by more efficient repression in absence of the inducer. c, pTrig copy numbers in the presence of pRec and various concentrations of PMS, and FCN(R) or FCN(O) in anaerobic condition. Fold change of the pTrig copy numbers at the given concentrations of FCN(R) or FCN(O) were plotted. d, Various aTc concentrations and (e) induction time for the expression of cas1 and cas2 genes were tested for CRISPR array expansion. f, Various FCN(R) and FCN(O) concentrations were tested for pTrig copy number induction and (g) pTrig-derived spacer incorporation. The proportions of pTrig-derived spacers among all newly incorporated spacers are displayed. All measurements are based on three biological replicates. Error bars represent s.d. of three biological replicates.
a, In an anaerobic chamber, a Raspberry Pi controls 3 of 8-channel relay modules (total 24 relays), which turn on or off electrical signals into each chamber pair from a power supply, based on a python script running on a wirelessly connected PC. b, A pair of working and counter chambers is connected by an agar salt bridge. In a working chamber, cells are incubated in M9 minimal medium supplemented with antibiotics, aTc, FCN(R) and PMS. M9 minimal medium supplemented with FCN(O) and PMS is filled in another chamber (counter). c, A photograph of the multi-channel electrochemical redox controller in an anaerobic chamber. d, Changes in electrochemical redox states of FCN(R) in a working chamber (left) and FCN(O) in a counter chamber (right) measured by absorbance at 420 nm with (0.5 V) and without (0.0 V) electronic signals. All measurements are based on three replicates. Error bars represent s.d. of three replicates.
a, Schematic diagram of experimental steps for multi-round encoding. After each round of electrical stimulation, the cell population was recovered in the rich medium (LB) aerobically so that the induced/uninduced plasmid copy number in the previous encoding round can be diluted out and reset low. b, To determine the recovery condition, anaerobic and aerobic conditions were compared. c, Overlaid distributions of the plasmid copy numbers with/without signals at each round over the course of the multi-round encoding (Fig. 2b). d, CRISPR array expansion over the course of the experiment. e, The 3-bit binary data profiles are grouped by the number of electronic signals, and the proportions of pTrig-derived spacers among all newly incorporated spacers are displayed. f, To enrich the sequencing reads for expanded arrays with more new spacers (longer arrays), the magnetic bead-based size enrichment was performed. Frequency of arrays of different lengths (unexpanded and L1-L4) with and without size enrichment are plotted. g, Principal component analysis on the array-type frequency profiles for the 3-bit digital data profiles. All 9 independent biological replicates are shown for each 3-bit digital data profiles. The first three independent datasets used for training of the Random Forest classifier are highlighted. All measurements are based on two or more biological replicates. Error bars represent s.d. of three or more biological replicates.
a, Confusion matrix from cross validation of the Random Forest classifier for 10 times by training on randomly selected 2 datasets for each 3-bit digital data profile from the 3 independent experiments and testing the trained model on the left-out 1 dataset. b, Importance of features (array-types) for the Random Forest classifier in Fig. 2f. c, Classification performance for the number of CRISPR arrays. CRISPR arrays with new uniquely mapping spacers were randomly subsampled to the various numbers for the 3-bit digital data profiles and classifications were performed. Recall accuracies for distinguishing 8 different types of 3-bit digital data profiles were displayed as a function of the number of expanded arrays with uniquely mapping spacers (grey: all arrays, red: L2/L3 arrays). The number of sequencing reads corresponding to the number of expanded arrays with uniquely mapping spacers (grey: all arrays) is also provided as an additional x-axis. Shaded regions represent 95% confidence interval of 10 iterations of subsampling and classification. d, Recall accuracies for distinguishing 8 different types of 3-bit digital data profiles with varying proportions of randomly selected training datasets for each 3-bit digital data profile. Shaded regions represent 95% confidence interval of 100 iterations of subsampling and classification.
a, CRISPR arrays can be barcoded with 8-bp unique sequences either downstream of the 1st spacer region or within direct repeat (DR) region. b, CRISPR array expansion rates (relative to wild-type array) of 69 DR-barcoded CRISPR arrays and 24 spacer-barcoded CRISPR arrays. c, Distribution of array expansion rates of spacer-barcoded CRISPR arrays is much more uniform and consistent than that of DR-barcoded CRISPR arrays. A DR variant (d1) that was more efficient than the wild-type DR sequence in the initial 96-well plate-based test is highlighted. d, The d1 DR variant was tested again in tube culture condition. In tube culture condition, however, the DR variant did not show significantly higher activity than that of the wild-type DR sequence. e, Comparison of CRISPR array expansion rates measured individually or in pool. Shaded region represents 95% confidence interval for linear regression (dashed grey line). Sample sizes (n) and Person correlation coefficient (r) are shown. All measurements are based on three biological replicates. Error bars represent s.d. of three biological replicates.
a, Data storage capacity (‘n’ bits of information or ‘n’ rounds of encoding) per cell population is estimated as a function of Cas1-Cas2 activity (‘X’ proportion of the cell population expanded arrays with a new spacer after a single round of encoding). Here, ‘Xn’ proportion of the cell population would have expanded arrays every round resulting ‘n’ new spacers (Ln arrays) after ‘n’ rounds of encoding, and we assumed that the sampling capacity for the Ln array population governs the data storage capacity. We considered various sampling depths ‘D’, where ‘D’ proportion of the cell population can be sufficiently sampled. This ‘D’ could be affected by many factors including the sequencing depth and size enrichment efficiency. We assumed that if the ‘Xn’ is same or higher than the given sampling depth constraint ‘D’, ‘n’ bits can be stored and reliably decoded. For example, when 0.001 of the cell population can be sufficiently sampled (D=0.001), maximum data storage capacity would be 3 bits (n=3) with the current Cas1-Cas2 activity level (X=0.1) as in our current experimental dataset (highlighted in red in the plot). And when 0.0001 of the cell population can be sufficiently sampled (D=0.0001), maximum data storage capacity would be 4 bits (n=4) with the current Cas1-Cas2 activity level (X=0.1). Although the Illumina MiSeq v2 300 cycles kit used in this study can read only up to 5 new spacers, we assumed that sequencing read length is not the limiting factor in this projection as other long read sequencing technologies could be employed. b, Estimated total data storage capacity across barcoded cell populations as a function of Cas1-Cas2 activity and the number of parallel channels in the culture platform at two different sampling depths (D=0.001 and D=0.00001). A larger data per cell population would require more rounds of encoding which takes longer time, and a larger number of parallel channels would require more barcoded cell populations and more sophisticated design of the culture platform. Current capacity of the system with 24 channels in the culture platform is highlighted in blue in the plot.
a, Probability of correct classification for each of the 3-bit digital data profiles by the Random Forest classifier on the newly generated independent datasets is calculated based on the result in Fig. 2f. b, DEC and OPT encoding tables with estimated probabilities of correct classification for the 64 characters. OPT 6-bit encoding table was designed by considering the correct classification probability and the usage frequency of the characters (https://mdickens.me/typing/letter_frequency.html). c, Probability of correct decoding for the 64 characters (ordered by usage) with DEC and OPT 6-bit encoding tables. d, Comparison of predicted probabilities of correct decoding for various text messages based on the two encoding tables. The predicted probabilities of correct decoding for each character or text message were calculated by multiplying the correct decoding probability values of each 3-bit digital data profile units.
Sequencing reads from each barcode in the ‘hello world!’-encoded cell population using OPT table were randomly subsampled to the various numbers and classifications were performed. Recall accuracies for (a) distinguishing 3-bit digital data profiles for 24 barcoded populations or for (b) calling correct bits out of 72 bits were displayed as a function of the number of expanded arrays with uniquely mapping spacers (grey: all arrays, red: L2/L3 arrays). The number of sequencing reads corresponding to the number of expanded arrays with uniquely mapping spacers (grey: all arrays) is also provided as an additional x-axis. Shaded regions represent 95% confidence interval of 10 iterations of subsampling and classification.
a, By using every sixth bit as a check point (checksum) for the first 5 bits, errors in data reconstruction can be detected and corrected for the selected 32 combinations of 6-bit digital data profiles based on the classifier’s confusion probability in Fig. 2f and Extended Data Fig. 9b. For example, for a digital input ‘011110’ could be classified as ‘011110’, ‘011010’, ‘001110’, or ‘001010’ with the probabilities of 69%, 14%, 14%, or 3%, respectively. Out of these 4 possible initial classifications, the last 3 are wrong and the 2 wrong classifications with a single bit error can be detected by the check point values and fixed. However, the classification result with 2 bits error cannot be detected by the check point value and therefore cannot be fixed. For all 32 combinations of 6-bit digital data profiles, possible classification results, their probabilities, and whether they can be fixed or not are summarized in Supplementary Table 2. b, Confusion probability for each of the 3-bit digital data profiles based on Fig. 2f. c, The check point values for each combination of eight 3-bit and four 2-bit digital data profiles. d, OPT2 encoding table with the estimated probabilities of correct classification for the 32 characters. e, Probability of correct decoding for the 32 characters (ordered by usage) for OPT and OPT2 6-bit encoding tables. f, ‘synbio@cu’ encoded in the genomes of barcoded E. coli populations using the OPT2 error correction strategy. Two errors from the initial classification were detected using the check points and successfully corrected as described in the figure. For classification of each barcoded cell population, an average of 492,289 total sequencing reads with 268,066 reads of expanded arrays (or 106,242 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. Frequencies of array-types are in log10 scale. All measurements are based on a single experimental study.
A mixed pool of 24 barcoded cell population encoded with a 72-bit text message ‘hello world!’ in Fig. 3 was subsequently diluted 1:100 every 24 hours into 3 mL fresh LB media with antibiotic for a total of 16 days (~106 generation, ~6.6 generations per day). a, Data stability in the propagating cell population over 100 generations. Accuracy indicates the proportion of bits that are correctly classified. >90% of the 72 bits could be correctly retrieved up to ~80 generations. Shaded region represents s.d. of three biological replicates. For classification of each barcoded cell population, an average of 82,860 of total sequencing reads with 40,502 reads of expanded arrays (or 17,139 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. b, Gradual changes in the relative abundance of 24 barcoded cell population over time suggests adaptive mutations with fitness effects arising in some of the subpopulation. Samples were collected at the time points indicated by arrows (day 0, 4, 6, 8, 12, and 16). All measurements are based on three biological replicates.
About this article
Cite this article
Yim, S.S., McBee, R.M., Song, A.M. et al. Robust direct digital-to-biological data storage in living cells. Nat Chem Biol 17, 246–253 (2021). https://doi.org/10.1038/s41589-020-00711-4
This article is cited by
Nature Protocols (2023)
Nature Communications (2023)
Nature Communications (2023)
Nature Communications (2022)