Towards practical and robust DNA-based data archiving using the yin–yang codec system

DNA is a promising data storage medium due to its remarkable durability and space-efficient storage. Early bit-to-base transcoding schemes have primarily pursued information density, at the expense of introducing biocompatibility challenges or decoding failure. Here we propose a robust transcoding algorithm named the yin–yang codec, using two rules to encode two binary bits into one nucleotide, to generate DNA sequences that are highly compatible with synthesis and sequencing technologies. We encoded two representative file formats and stored them in vitro as 200 nt oligo pools and in vivo as a ~54 kbps DNA fragment in yeast cells. Sequencing results show that the yin–yang codec exhibits high robustness and reliability for a wide variety of data types, with an average recovery rate of 99.9% above 104 molecule copies and an achieved recovery rate of 87.53% at ≤102 copies. Additionally, the in vivo storage demonstration achieved an experimentally measured physical density close to the theoretical maximum.

The pipeline for the YYC system includes three main steps: segmentation, incorporation, and validity-screening (Figure 1b).For segmentation, the binary information is extracted from a source file and partitioned into multiple segments according to requirements.Binary indices are assigned and added to each segment to form a new one (information + index), and thus a pool of binary segments of identical length are then obtained.In the incorporation step, two segments are selected randomly to be incorporated into one DNA sequence according to the YYC algorithm.
For the validity-screening in the last step, the incorporated sequence will be subjected to screening against pre-set constraints, including GC content, maximal homopolymer length, the secondary structure free energy, etc., and only sequences that meet set criteria will be considered as valid sequences.If the DNA sequence fails to meet these criteria, it will be discarded.The second selected binary segment (using "Yin" rule) will be put back into the pool and the iteration process will be activated: another randomly generated binary segment from the library will be added for incorporation with the first selected segment (using "Yang" rule) until a valid DNA sequence is generated.Considering for some digital information with particular data pattern (i.e., extreme ratio of 0/1), generating a valid DNA sequence with biocompatible features might be challenging since multiple runs of binary segment incorporation would be required.Thus, we evaluated the possibility of generating valid sequence by YYC coding schemes under varying range of 0/1 ratio from 0% to 50%.By simulating the incorporation of all binary sequences of one byte segment (8 bits), our result shows that the possibility of generating valid sequence (i.e., the number of coding schemes in the 1536 collection that can generate valid sequence) significantly drop from 100% to 49.7% in the presence of binary sequence with the 0/1 ratio lower than 20% (Supplementary Table 2).It implies that with the binary segment contains extreme 0/1 ratio, the encoding process will be significantly time consuming.To avoid the circumstance, we established a pre-screening process to identify 0/1 biased binary segments (0/1 ratio ≤ 20%), to which a "firewall" is set for the upper limit of iteration runs of incorporation at 100 (Supplementary Table 3).For segments that fail to pass the process, a "pseudo" binary segment with random 0/1 but in balanced ratio will be introduced to allow the generation of a valid DNA sequence.By analyzing the number of iteration runs of encoding files, our result shows that 65.04% of all segments can be successfully paired and pass the screening at first attempt.Only less than 0.002% of segments fail to generate a valid sequence after 100 iterations.Even in the worst case with most unbalanced data pattern we observed, the additional information added to the source file for successful transcoding accounts for only 19.25% of the original file size, with the average number of trails at ~7 and information density at ~1.45 bits/nt (Supplementary Table 3).It suggests that the YYC would not cost a large encode-time overheads (Supplementary Figure 4).Since very limited "pseudo" binary segments would be added into the source files, the information density can be well-maintained at a relatively high level for YYC.

Supplementary Section 2: Quantitative analysis
According to the Shannon information entropy [1] , the information density ( THEORY ) of DNA-based data storage can be defined as where  DNA is the number of nucleotides of a DNA sequence and  DNA refers to the amount of available DNA sequences of length  DNA .In the condition without any constraints introduced,  DNA equals to 4  DNA and  THEORY is 2.

Constraint coding in DNA-based data storage
In practice, considering the compatibility with DNA synthesis and sequencing technologies, maximum homopolymer runs (HOMO) and GC content bias (GCBIAS) are two most critical biochemical constraints for transcoding.
Let  DNA  be the amount of valid DNA sequences of length  DNA under the constraints of HOMO and GCBIAS.The influence of HOMO on  DNA  can be represented as: where λ is the largest real root [2] of the equation: Simply, combining equation ( 1) and ( 2), the constraint of maximum homopolymer runs ( HOMO ) is: Similarly, the constraints of GCBIAS on  DNA  can be represented as: where (  DNA  ) represents the combination number of  selected from  DNA and 2 refers to two set of nucleotides which affect the GC ratio.The GC content of sequence is within the interval of [ Combining equation ( 1) and ( 5), the constraint of GC content bias ( GCBIAS ) is: Considering index used in DNA-based data storage because of its unordered nature, let  BIN be the number of binary segments and  BIN be the length of binary segment in practice.Their impact on the theoretical upper bound of information density can be represented as: where log 2  BIN represents the combination number of all kinds of possible index.
Simply, combining the equation ( 1) and ( 7), the constraint of index ( INDEX ) is Summing up the above, the theoretical upper bound of information density ( THEORY ) can be represented.The influence of these constraints ( HOMO ,  GCBIAS , and  INDEX ) on  THEORY is usually overlapped.According to set theory [3],  THEORY can be solitarily influenced by most influential constraints, that is: Assuming that the influence of all constraints on   is disjoint, there must be an existing coding scheme as the baseline, the baseline information density ( BASELINE ) of which satisfies that:

Calculation of Theoretical information density of YYC
By applying the above equations, it is easy to evaluate the difference between actual information density interval and theoretical upper bound for a transcoding algorithm.
Equivalently, the actual information density of YYC ( ACTUAL ) can be represented as where

Table 3 . The estimation of iteration runs and corresponding information density by transcoding 10 different types/formats of files.
BIN represents the original number of binary segments obtained in the Supplementary

Table 4 . The estimation of information density and logical redundancy by DNA Fountain and YYC to varying file types.
Net information density indicates the input information in bits divided by the nucleotides generated by coding schemes (excluding flanking primers).Minimum logical redundancy required for successful encoding and decoding larger than 100% is marked as 'N/A'.Bolded value indicates it is the best one among the tests.