Abstract
We describe the first DNAbased storage architecture that enables random access to data blocks and rewriting of information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks of existing readonly methods that require decoding the whole file in order to read one data fragment. Our system is based on new constrained coding techniques and accompanying DNA editing methods that ensure data reliability, specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media suitable for both ultrahigh density archival and rewritable storage applications.
Introduction
Addressing the emerging demands for massive data repositories, and building upon the rapid development of technologies for DNA synthesis and sequencing, a number of laboratories have recently outlined architectures for archival DNAbased storage^{1,2,3,4,5}. The architecture in^{3} achieved a storage density of 87.5 TB/gram, while the system described in^{4} raised the density to 2.2 PB/gram. The success of the latter method may be largely attributed to three classical coding schemes: Huffman coding, differential coding, and single paritycheck coding^{4}. Huffman coding was used for data compression, while differential coding was used for eliminating homopolymers (i.e., repeated consecutive bases) in the DNA strings. Paritychecks were used to add controlled redundancy, which in conjunction with fourfold coverage allows for mitigating assembly errors.
Due to dynamic changes in biotechnological systems, none of the three coding schemes represents a suitable solution from the perspective of current DNA sequencer designs: Huffman codes are fixedtovariable length compressors that can lead to catastrophic error propagation in the presence of sequencing noise; the same is true of differential codes. Homopolymers do not represent a significant source of errors in Illumina sequencing platforms^{6}, while single parity redundancy or RS codes and differential encoding are inadequate for combating errorinducing sequence patterns such as long substrings with high GC content^{6}. As a result, assembly errors are likely, and were observed during the readout process described in^{4}.
An even more important issue that prohibits the practical widespread use of the schemes described in^{3,4} is that accurate partial and random access to data is impossible, as one has to reconstruct the whole text in order to read or retrieve the information encoded even in a few bases. Furthermore, all current designs support readonly storage. The first limitation represents a significant drawback, as one usually needs to accommodate access to specific data sections; the second limitation prevents the use of current DNA storage methods in architectures that call for moderate data editing, for storing frequently updated information and memorizing the history of edits. Moving from a readonly to a rewritable DNA storage system requires a major implementation paradigm shift, as:
Editing in the compressive domain may require rewriting almost the whole information content;
Rewriting is complicated by the current data DNA storage format that involves reads of length 100 bps shifted by 25 bps so as to ensure fourfold coverage of the sequence (See Fig. 1(a) for an illustration and description of the data format used in^{4}). In order to rewrite one base, one needs to selectively access and modify four “consecutive” reads;
Addressing methods used in^{3,4} only allow for determining the position of a read in a file, but cannot ensure precise selection of reads of interest, as undesired crosshybridization between the primers and parts of the information blocks may occur.
To overcome the aforementioned issues, we developed a new, randomaccess and rewritable DNAbased storage architecture based on DNA sequences endowed with specialized address strings that may be used for selective information access and encoding with inherent errorcorrection capabilities. The addresses are designed to be mutually uncorrelated and to satisfy the errorcontrol running digital sum constraint^{7,8}. Given the address sequences, encoding is performed by stringing together properly terminated prefixes of the addresses as dictated by the information sequence. This encoding method represents a special form of prefixsynchronized coding^{9}. Given that the addresses are chosen to be uncorrelated and at large Hamming distance from each other, it is highly unlikely for one address to be confused with another address or with another section of the encoded blocks. Furthermore, selection of the blocks to be rewritten is made possible by the prefix encoding format, while rewriting is performed via two DNA editing techniques, the gBlock and OEPCR (Overlap Extension PCR) methods^{10,11}. With the latter method, rewriting is done in several steps by using short and cheap primers. The first method is more efficient, but requires synthesizing longer and hence more expensive primers. Both methods were tested on DNA encoded Wikipedia entries of size 17 KB, corresponding to six universities, where information in one, two and three blocks was rewritten in the DNA encoded domain. The rewritten blocks were selected, amplified and Sanger sequenced^{12} to verify that selection and rewriting are performed with 100% accuracy.
Results
The main feature of our storage architecture that enables highly sensitive random access and accurate rewriting is addressing. The rational behind the proposed approach is that each block in a random access system must be equipped with an address that will allow for unique selection and amplification via DNA sequence primers.
Instead of storing blocks mimicking the structure and length of reads generated during highthroughput sequencing, we synthesized blocks of length 1000 bps tagged at both ends by specially designed address sequences. Adding addresses to short blocks of length 100 bps would incur a large storage overhead, while synthesizing blocks longer than 1000 bps using current technologies is prohibitively costly.
More precisely, each data block of length 1000 bps is flanked at both ends by two unique, yet different, address blocks of length 20 bps each. These addresses are used to provide specificity of access (see Fig. 1(b) and the Supplementary Information for details). Note that different flanking addresses simplify the process of sequence synthesis. The remaining 960 bases in a block are divided into 12 subblocks of length 80 bps, with each block encoding six words of the text. The “wordencoding” process may be seen as a specialized compaction scheme suitable for rewriting, and it operates as follows. First, different words in the text are counted and tabulated in a dictionary. Each word in the dictionary is converted into a binary sequence of length sufficiently long to allow for encoding of the dictionary. For our current implementation and texts of choice, described in the Supplementary Information section, this length was set to 21. Encodings of six consecutive words are subsequently grouped into binary sequences of length 126. The binary sequences are then translated into DNA blocks of length 80 bps using a new family of DNA prefixsynchronized codes described in the Methods section. Our choice for the number of jointly encoded words is governed by the goal to make rewrites as straightforward as possible and to avoid error propagation due to variable code lengths. Furthermore, as most rewrites include words, rather than individual symbols, the word encoding method represents an efficient means for content update. Details regarding the counting and grouping procedure may be found in the Supplementary Information.
For three selected access queries, the 1000 bps blocks containing the desired information were “identified” (i.e., amplified) via primers corresponding to their unique addresses, PCR amplified, Sanger sequenced, and subsequently decoded.
Two methods were used for content rewriting. If the region to be rewritten had length exceeding several hundreds, new sequences with unique primers were synthesized as this solution represents a less costly alternative to rewriting. For the case that a relatively short substring of the encoded string had to be modified, the corresponding 1000 bps block hosting the string was identified via its address, amplified and the changes were generated via DNA editing.
Both the random access and rewriting protocols were tested experimentally on two jointly stored text files. One text file, of size 4 KB, contained the history of the University of Illinois, UrbanaChampaign (UIUC) based on its Wikipedia entry retrieved on 12/15/2013. The other text file, of size 13 KB, contained the introductory Wikipedia entries of Berkeley, Harvard, MIT, Princeton, and Stanford, retrieved on 04/27/2014.
Encoded information was converted into DNA blocks of length 1000 bps synthesized by IDT (Integrated DNA Technologies), at a cost of $149 per 1000 bps (see http://www.idtdna.com/pages/products/genes/gblocksgenefragments). The rewriting experiments encompassed:
PCR selection and amplification of one 1000 bps sequence and simultaneous selection and amplification of three 1000 bps sequences in the pool. All 32 linear 1000 bps fragments were stored in a mixed form, and the mixture was used as a template for PCR amplification and selection. The results of amplification were verified by confirming sequence lengths of 1000 bps via gel electrophoresis (Fig. 2(a)) and by randomly sampling 3–5 sequences from the pools and Sanger sequencing them (Fig. 2(b)).
Experimental content rewriting via synthesis of edits located at various positions in the 1000 bps blocks. For simplicity of notation, we refer to the blocks in the pool on which we performed selection and editing as B1, B2, and B3. Two primers were synthesized for each rewrite in the blocks, for the forward and reverse direction. In addition, two different editing/mutation techniques were used, gBlock and OEPCR. gBlocks are doublestranded genomic fragments used as primers or for the purpose of genome editing, while OEPCR is a variant of PCR used for specific DNA sequence editing via point editing/mutations or splicing. To demonstrate the plausibility of a cost efficient method for editing, OEPCR was implemented with general primers (≤60 bps) only. Note that for edits shorter than 40 bps, the mutation sequences were designed as overhangs in primers. Then, the three PCR products were used as templates for the final PCR reaction involving the entire 1000 bps rewrite. Figure 1(c) illustrates the described rewriting process. In addition, a summary of the experiments performed is provided in Table 1.
Given that each basepair has weight roughly equal to 650 daltons (650 × 1.67 × 10^{−24} grams), and given that 27,000 + 5000 = 32,000 bps were needed to encode a file of size 13 + 4 = 17 KB in ASCII format, we estimate a potential storage density of 4.9 × 10^{20} B/g for our scheme. This density significantly surpasses the current stateoftheart storage density of 2.2 × 10^{15} B/g, as we avoid costly multiple coverage, use larger blocklengths and specialized word encoding schemes of large rate. A performance comparison of the three currently known DNAbased storage media is given in Table 2. We observe that the cost of sequence synthesis in our storage model is clearly significantly higher than the corresponding cost of the prototype in^{4}, as blocks of length 1000 bps are still difficult to synthesize. This trend it likely to change dramatically in the near future, as within the last seven months, the cost of synthesizing 1000 bps blocks reduced almost 7fold. Despite its high cost, our system offers exceptionally large storage density, and for the first time, enables random access and content rewriting features. Furthermore, although we used Sanger sequencing methods for our small scale experiment, for large scale storage projects Next Generation Sequencing (NGS) technologies will enable significant reductions in readout costs.
Methods
Address Design and Encoding
To encode information on DNA media, we employed a twostep procedure. First, we designed address sequences of short length which satisfy a number of constraints that makes them suitable for highly selective random access^{13}. Constrained coding ensures that DNA patterns prone to sequencing errors are avoided and that DNA blocks are accurately accessed, amplified and selected without perturbing or accidentally selecting other blocks in the DNA pool. The coding constraints apply to address primer design, but also indirectly govern the properties of the fully encoded DNA information blocks. The design procedure used is semianalytical, in so far that it combines combinatorial methods with limited computer search techniques. A unifying and highly technically charged coding approach will be reported elsewhere.
We required the address sequences to satisfy the following constraints:
(C1) Constant GC content (close to 50%) of all their prefixes of sufficiently long length. DNA strands with 50% GC content are more stable than DNA strands with lower or higher GC content and have better coverage during sequencing. Since encoding user information is accomplished via prefixsynchronization, it is important to impose the GC content constraint on the addresses as well as their prefixes, as the latter requirement also ensures that all fragments of encoded data blocks have balanced GC content.
(C2) Large mutual Hamming distance, as it reduces the probability of erroneous address selection. Recall that the Hamming distance between two strings of equal length equals the number of positions at which the corresponding symbols disagree. An appropriate choice for the minimum Hamming distance is equal to half of the address sequence length (10 bps in our current implementation which uses length 20 address primers). It is worth pointing out that rather than using the Hamming distance, one could use the Levenshtein (edit) distance instead, capturing the smallest number of deletions, insertions and substitutions needed to convert one string into another. Unfortunately, many address design problems become hard to analyze under this distance measure, and are hence not addressed in this manuscript.
(C3) Uncorrelatedness of the addresses, which imposes the restriction that prefixes of one address do not appear as suffixes of the same or another address and vice versa. The motivation for this new constraint comes from the fact that addresses are used to provide unique identities for the blocks, and that their substrings should therefore not appear in “similar form” within other addresses. Here, “similarity” is assessed in terms of hybridization affinity. Furthermore, long undesired prefixsuffix matches may lead to read assembly errors in blocks during joint informational retrieval and sequencing.
(C4) Absence of secondary (folding) structures, as such structures may cause errors in the process of PCR amplification and fragment rewriting.
Addresses satisfying constraints C1–C2 may be constructed via errorcorrecting codes with small running digital sum^{7} adapted for the new storage system. Properties of these codes are discussed in Section 0. The notion of mutually uncorrelated sequences is described in 0; it was studied in an unrelated context under the name crossbifixfree coding. We also introduce a new, and more suitable version of crossbifixfree codes termed weakly mutually uncorrelated sequences. Constructing addresses that simultaneously satisfy the constraints C1C4 and determining bounds on the largest number of such sequences is prohibitively complex^{14,15}. To mitigate this problem, we resort to a semiconstructive address design approach, in which balanced errorcorrecting codes are designed independently, and subsequently expurgated so as to identify a large set of mutually uncorrelated sequences. The resulting sequences are subsequently tested for secondary structure using mfold and Vienna^{16}. In an upcoming paper, we show that the number of sequences simultaneously satisfying C1C3 is exponentially large. We conjecture that the number of sequences satisfying all constraints, C1C4, also grows exponentially with their length.
Given two uncorrelated sequences as flanking addresses of one block, one of the sequences is selected to encode user information via a new implementation of prefixsynchronized encoding^{16,17}, described in 0. The asymptotic rate of an optimal single sequence prefixfree code is one. Hence, there is no asymptotic coding loss for avoiding prefixes of one sequence; we only observe a minor coding loss for each finitelength block. For multiple sequences of arbitrary structure, the problem of determining the optimal code rate is significantly more complicated and the rates have to be evaluated numerically, by solving systems of linear equations^{17} as described in 0 and the Supplementary Information. This system of equations leads to a particularly simple form for the generating function of mutually uncorrelated sequences.
Balanced Codes and Running Digital Sums
One criteria for selecting block addresses is to ensure that the corresponding DNA primer sequences have prefixes with a GC content approximately equal to 50%, and that the sequences are at large pairwise Hamming distance. Due to their applications in optical storage, codes that address related issues have been studied in a different form under the name of bounded running digital sum (BRDS) codes^{7,8}. A detailed overview of this coding technique may be found in^{7}.
Consider a sequence a = a_{0}, a_{1}, a_{2}, …, a_{l}, …, a_{n} over the alphabet {−1, 1}. We refer to as the running digital sum (RDS) of the sequence a up to length l, l ≥ 0. Let denote the largest value of the running digital sum of the sequence a. For some predetermined value D > 0, a set of sequences is termed a BRDS code with parameter D if D_{a(i)} ≤ D for all i = 1, …, M. Note that one can define nonbinary BRDS codes in an equivalent manner, with the alphabet usually assumed to be symmetric, {−q, −q + 1, …, −1, 1,…, q − 1, q}, and where q ≥ 1. A set of DNA sequences over {A, T, G, C} may be constructed in a straightforward manner by mapping each +1 symbol into one of the bases {A, T}, and −1 into one of the bases {G, C}, or vice versa. Alternatively, one can use BRDS over an alphabet of size four directly.
To address the constraints C1–C2, one needs to construct a large set of BRDS codewords at sufficiently large Hamming distance from each other. Via the mapping described above, these codewords may be subsequently translated to DNA sequences with a GC content approximately equal to 50% for all sequence prefixes, and at the same Hamming distance as the original sequences.
Let (n, C, d; D) be the parameters of a BRDS errorcorrecting code, where C denotes the number of codewords of length n, d denotes the minimum distance of the code, while equals the code rate. For D = 1 and d = 2, the best known BRDScode has parameters , while for D = 2 and d = 1, codes with parameters exist. For D = 2 and d = 2, the best known BRDS code has parameters ^{8}. Note that each of these codes has an exponentially large number of codewords, among which an exponentially large number of sequences satisfy the required correlation property C3, discussed next. Codewords satisfying constraint C4 were found by expurgating the BRDS codes via computer search.
Sequence Correlation
We describe next the notion of the autocorrelation of a sequence and describe mutually uncorrelated sequences (i.e., crossbifixfree codes) and the new class of weakly mutually uncorrelated sequences. Mutually uncorrelated sequences, crossbifixfree and nonoverlapping codes were introduced and rediscovered many times, as witnessed by the publications^{18,19,20,21}.
It was shown in^{17} that the autocorrelation function is the crucial mathematical concept for studying sequences avoiding forbidden strings and substrings. In the storage context, forbidden strings correspond to the addresses of the blocks in the pool. In order to accommodate the need for selective retrieval of a DNA block without accidentally selecting any undesirable blocks, we find it necessary to also introduce the notion of mutually uncorrelated sequences.
Let X and Y be two words, possibly of different lengths, over some alphabet of size q > 1. The correlation of X and Y, denoted by , is a binary string of the same length as X. The ith bit (from the left) of is determined by placing Y under X so that the leftmost character of Y is under the ith character (from the left) of X, and checking whether the characters in the overlapping segments of X and Y are identical. If they are identical, the ith bit of is set to 1, otherwise, it is set to 0. For example, for X = CATCATC and Y = ATCATCGG, = 0100100, as depicted below.
Note that in general, , and that the two correlation vectors may be of different lengths. In the example above, we have . The autocorrelation of a word X equals .
In the example below, = 1001001.
Definition 1. A sequence X is selfuncorrelated if . A set of sequences {X_{1}, X_{2}, …, X_{M}} is mutually uncorrelated (crossbifixfree) if each sequence is selfuncorrelated and if all pairs of distinct sequences satisfy and .
Intuitively, correlation captures the extent to which prefixes of sequences overlap with suffixes of the same or other sequences. Furthermore, the notion of mutual uncorrelatedness may be relaxed by requiring that only sufficiently long prefixes do not match sufficiently long suffixes of other sequences. Sequences with this property, and at sufficiently large Hamming distance, eliminate undesired address crosshybridization during selection and crosssequence assembly errors.
We provide the following extremely simple and easytoprove bound on the size of the largest mutually uncorrelated set of sequences of length n over an alphabet of size q = 4. The bounds show that there exist exponentially many mutually uncorrelated sequences for any choice of n, and the lower bound is constructive. Furthermore, the construction used in the bound “preserves” the Hamming distance and GC content, which distinguishes it from any known results in classical coding theory.
Theorem 2. Suppose that {X_{1}, …, X_{M}} is a set of M pairwise mutually uncorrelated sequences of length n. Let u(n) denote the largest possible value of M for a given n. Then
As an illustration, for n = 20, the lower bound equals 972. The proof of the theorem is give in the Supplementary Information.
It remains an open problem to determine the largest number of address sequences that jointly satisfy the constraints C1–C4. We conjecture that the number of such sequences is exponential in n, as the numbers of words that satisfy C1–C3 and C4^{15} are exponential. Exponentially large families of address sequences are important indicators of the scalability of the system and they also influence the rate of information encoding in DNA.
Using a casting of the address sequence design problem in terms of a simple and efficient greedy search procedure, we were able to identify 1149 sequences for length n = 20 that satisfy constraints C1–C4, out of which 32 pairs were used for block addressing. Another means to generate large sets of sequences satisfying the constraints is via approximate solvers for the largest independent set problem^{22}. Examples of sequences constructed in the aforementioned manner and used in our experiments are listed in the Supplementary Information.
PrefixSynchronized DNA Codes
In the previous sections, we described how to construct address sequences that can serve as unique identifiers of the blocks they are associated with. We also pointed out that once such address sequences are identified, user information has to be encoded in order to avoid the appearance of any of the addresses, sufficiently long substrings of the addresses, or substrings similar to the addresses in the resulting DNA codeword blocks. For this purpose, we developed new prefixsynchronized encoding schemes based on ideas presented in^{14}, but generalized to accommodate multiple sequence avoidance.
To address the problem at hand, we start by introducing comma free and prefixsynchronized codes which allow for constructing codewords that avoid address patterns. A block code comprising a set of codewords of length N over an alphabet of size q is called comma free if and only if for any pair of not necessarily distinct codewords a_{1}a_{2} … a_{N} and b_{1}b_{2} … b_{N} in , the N concatenations a_{2}a_{3} … a_{N}b_{1}, a_{3}a_{4} … b_{1}b_{2}, …, a_{N}a_{1} … b_{N−2}b_{N−1} are not in ^{17}. Comma free codes enable efficient synchronization protocols, as one is able to determine the starting positions of codewords without ambiguity. A major drawback of comma free codes is the need to implement an exhaustive search procedure over sequence sets to decide whether or not a given string of length n should be used as a codeword or not. This difficulty can be overcome by using a special family of comma free codes, introduced by Gilbert^{9} under the name prefixsynchronized codes. Prefixsynchronized codes have the property that every codeword starts with a prefix p = p_{1}p_{2} … p_{n}, which is followed by a constrained sequence . Moreover, for any codeword of length , the prefix p does not appear as a substring of . More precisely, the constrained sequences of prefixsynchronized codes avoid the pattern p which is used as the address. First, we point out that in our work, no consideration is given to concatenations of codewords as DNA blocks are stored unattached. Furthermore, due to the choice of mutually uncorrelated addresses at large Hamming distance, we can encode each information block by avoiding only one of the address sequences, used for that particular block. Avoidance of all other address sequences is automatically guaranteed by the lack of correlation between the sequences, as demonstrated in the proof of our encoding method.
Specifically, for a fixed set of address sequences of length n, we define the set to be the set of sequences of length such that each sequence in does not contain any string belonging to . Therefore, by definition, when , the set is simply the set of strings of length . Our objective is then to design an efficient encoding algorithm (onetoone mapping) to encode a set of messages into . For the sake of simplicity, we let .
In this scheme, we assume that is mutually uncorrelated and all sequences in end with the same base, which we assume without loss of generality to be G. We then pick an address and define the following entities for 1 ≤ i ≤ n,
In addition, assume that the elements of are arranged in increasing order, say using the lexicographical ordering . We subsequently use to denote the jth smallest element in , for For example, if then and
Next, we define a sequence of integers S_{n,1}, S_{n,2}, … that satisfies the following recursive formula
For an integer and , let be a length ternary representation of y. Conversely, for each , let θ^{−1}(W) be the integer y such that . We proceed to describe how to map every integer {0, 1, 2, …, S_{n,l} − 1} into a sequence of length in and vice versa. We denote these functions as Encode_{a,l} and Decode_{a}, respectively.
The steps of the encoding and decoding procedures are listed in Algorithm 1
The following theorem is proved in the Supplementary Information.
Theorem 3. Let be a set of mutually uncorrelated sequences that ends with the same base. Then for , Encode_{a,l} is an onetoone mapping from {0, 1, 2, …, S_{n,l} − 1} to . Moreover, for all x ∈ {0, 1, 2, …, S_{n,l} − 1}, Decode_{a}(Encode_{a,l}(x)) = x.
A simple example describing the encoding and decoding procedure for the short address string a = AGCTG, which can easily be verified to be selfuncorrelated, is provided in the Supplementary Information.
The previously described Encode_{a,l}(x) algorithm imposes no limitations on the length of a prefix used for encoding. This feature may lead to unwanted cross hybridization between address primers used for selection and the prefixes of addresses encoding the information. One approach to mitigate this problem is to “perturb” long prefixes in the encoded information in a controlled manner. For smallscale random access/rewriting experiments, the recommended approach is to first select all prefixes of length greater than some predefined threshold. Afterwards, the first and last quarter of the bases of these long prefixes are used unchanged while the central portion of the prefix string is cyclically shifted by half of its length. For example, for the address a = AGTAAGTCTCGCAGTCATCG, if the prefix a^{(16)} = AGTAAGTCTCGCAGTC appears as a subword, say V, in X = Encode_{a,l}(x) then X is modified to X′ by mapping V to V^{′} = AGTAATCGGTCCAGTC. This process of shifting is illustrated below:
For an arbitrary choice of the addresses, this scheme may not allow for unique decoding Encode_{a,l}. However, there exist simple conditions that can be checked to eliminate primers that do not allow this transform to be “unique”. Given the address primers created for our random access/rewriting experiments, we were able to uniquely map each modified prefix to its original prefix and therefore uniquely decode the readouts.
As a final remark, we would like to point out that prefixsynchronized coding also supports errordetection and limited errorcorrection. Errorcorrection is achieved by checking if each substring of the sequence represents a prefix or “shifted” prefix of the given address sequence and making proper changes when needed.
Discussion
We described a new DNA based storage architecture that enables accurate random access and costefficient rewriting. The key component of our implementation is a new collection of coding schemes and the adaptation of randomaccess enabling codes from classical storage systems. In particular, we encoded information within blocks with unique addresses that are prohibited to appear anywhere else in the encoded information, thereby removing any undesirable crosshybridization problems during the process of selection and amplification. We also performed four access and rewriting experiments without readout errors, as confirmed by postselection and rewriting Sanger sequencing. The current drawback of our scheme is high cost, as synthesizing long DNA blocks is expensive. Cost considerations also limited the scope of our experiments and the size of the prototype, as we aimed to stay within a budget comparable to that used for other existing architectures. Nevertheless, the benefits of random access and other unique features of the proposed system compensate for this high cost, which we predict will decrease rapidly in the very near future.
Additional Information
How to cite this article: Tabatabaei Yazdi, S. M. H. et al. A Rewritable, RandomAccess DNABased Storage System. Sci. Rep. 5, 14138; doi: 10.1038/srep14138 (2015).
References
 1.
Bancroft, C., Bowler, T., Bloom, B. & Clelland, C. T. Longterm storage of information in dna. Science (New York, NY) 293, 1763–1765 (2001).
 2.
Davis, J. Microvenus. Art Journal 55, 70–74 (1996).
 3.
Church, G. M., Gao, Y. & Kosuri, S. Nextgeneration digital information storage in dna. Science 337, 1628–1628 (2012).
 4.
Goldman, N. et al. Towards practical, highcapacity, lowmaintenance information storage in synthesized dna. Nature (2013).
 5.
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on dna in silica with errorcorrecting codes. Angewandte Chemie International Edition 54, 2552–2555 (2015).
 6.
Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol 14, R51 (2013).
 7.
Cohen, G. D. & Litsyn, S. Dcconstrained errorcorrecting codes with small running digital sum. Information Theory, IEEE Transactions on 37, 949–955 (1991).
 8.
Blaum, M., Litsyn, S., Buskens, V. & van Tilborg, H. C. Errorcorrecting codes with bounded running digital sum. IEEE transactions on information theory 39, 216–227 (1993).
 9.
Gilbert, E. Synchronization of binary messages. Information Theory, IRE Transactions on 6, 470–477 (1960).
 10.
Packer, H. CRISPR and Cas9 for flexible genome editing. Technical report. (2014) Available at: www.idtdna.com/pages/products/genes/gblocksgenefragments/decodedarticles/decoded/2013/12/13/crisprandcas9forflexiblegenomeediting. (Accessed: 1th January 2015).
 11.
Bryksin, A. V. & Matsumura, I. Overlap extension PCR cloning: a simple and reliable way to create recombinant plasmids. Biotechniques 48, 463 (2010).
 12.
Schuster, S. C. Nextgeneration sequencing transforms today’s biology. Nature methods 5, 16–18 (2008).
 13.
Immink, K. A. S. Codes for mass data storage systems (Shannon Foundation Publisher, 2004).
 14.
Morita, H., van Wijngaarden, A. J. & Han Vinck, A. On the construction of maximal prefixsynchronized codes. Information Theory, IEEE Transactions on 42, 2158–2166 (1996).
 15.
Milenkovic, O. & Kashyap, N. On the design of codes for dna computing. In Coding and Cryptography, 100–119 (Springer, 2006).
 16.
Rouillard, J.M., Zuker, M. & Gulari, E. Oligoarray 2.0: design of oligonucleotide probes for dna microarrays using a thermodynamic approach. Nucleic acids research 31, 3057–3062 (2003).
 17.
Guibas, L. J. & Odlyzko, A. M. Maximal prefixsynchronized codes. SIAM Journal on Applied Mathematics 35, 401–418 (1978).
 18.
Massey, J. L. Optimum frame synchronization. Communications, IEEE Transactions on 20, 115–119 (1972).
 19.
Bajic, D. On construction of crossbifixfree kernel sets. Cryptography and Communications archive 6, 27–37 (2014).
 20.
Chee, Y. M., Kiah, H. M., Purkayastha, P. & Wang, C. Crossbifixfree codes within a constant factor of optimality. Information Theory, IEEE Transactions on 59, 4668–4674 (2013).
 21.
Blackburn, S. R. Nonoverlapping codes. arXiv preprint arXiv:1303.1026 (2013).
 22.
Berman, P. & Fürer, M. Approximating maximum independent set in bounded degree graphs. In SODA, vol. 94, 365–371 (1994).
Acknowledgements
This work was partially supported by the Strategic Research Initiative of University of Illinois, UrbanaChampaign, and the NSF STC on Science of Information, Purdue University. A provisional patent for rewritable, randomaccess DNAbased storage was filed with the University of Illinois in November 2014. The authors would like to thank Dr. Han Mao Kiah for his helpful feedback, comments and insights regarding the work and its presentation.
Author information
Author notes
 S. M. Hossein Tabatabaei Yazdi
 , Yongbo Yuan
 & Olgica Milenkovic
These authors contributed equally to this work.
Affiliations
University of Illinois, Department of Electrical and Computer Engineering, Urbana, 61801, US
 S. M. Hossein Tabatabaei Yazdi
 & Olgica Milenkovic
University of Illinois, Department of Chemical and Biomolecular Engineering, Urbana, 61801, US
 Yongbo Yuan
 & Huimin Zhao
University of Illinois, Department of Bioengineering, Urbana, 61801, US
 Jian Ma
University of Illinois, Institute for Genomic Biology, Urbana, 61801, US
 Jian Ma
 & Huimin Zhao
Authors
Search for S. M. Hossein Tabatabaei Yazdi in:
Search for Yongbo Yuan in:
Search for Jian Ma in:
Search for Huimin Zhao in:
Search for Olgica Milenkovic in:
Contributions
S.Y. conducted the research and wrote the main manuscript text. Y.Y. conducted the experiments and wrote the main manuscript text. J.M., H.Z. and O.M. came up with the main idea, conducted the research and wrote the main manuscript text. All authors reviewed the manuscript.
Competing interests
The authors declare no competing financial interests.
Corresponding author
Correspondence to Olgica Milenkovic.
Supplementary information
PDF files
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Further reading

A highly parallel strategy for storage of digital information in living cells
BMC Biotechnology (2018)

Random access in largescale DNA data storage
Nature Biotechnology (2018)

DNA as a digital information storage device: hope or hype?
3 Biotech (2018)

BIIIA: a bioinformaticsinspired image identification approach
Multimedia Tools and Applications (2018)

Trends to store digital data in DNA: an overview
Molecular Biology Reports (2018)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.