Mass spectrometry sequencing of long digital polymers facilitated by programmed inter-byte fragmentation

In the context of data storage miniaturization, it was recently shown that digital information can be stored in the monomer sequences of non-natural macromolecules. However, the sequencing of such digital polymers is currently limited to short chains. Here, we report that intact multi-byte digital polymers can be sequenced in a moderate resolution mass spectrometer and that full sequence coverage can be attained without requiring pre-analysis digestion or the help of sequence databases. In order to do so, the polymers are designed to undergo controlled fragmentations in collision-induced dissociation conditions. Each byte of the sequence is labeled by an identification tag and a weak alkoxyamine group is placed between 2 bytes. As a consequence of this design, the NO-C bonds break first upon collisional activation, thus leading to a pattern of mass tag-shifted intact bytes. Afterwards, each byte is individually sequenced in pseudo-MS3 conditions and the whole sequence is found.

I t has been demonstrated in recent years that digital information can be stored in biological 1 and synthetic macromolecules 2 . In such digital polymers, the monomer units that constitute the chains are used as molecular bits and assembled through controlled synthesis into readable digital sequences 3 . For example, it has been reported that ordered oligonucleotide sequences enable storage of several kilobytes of data in DNA chains [4][5][6][7] . Alternatively, our group has demonstrated that binary messages can also efficiently be stored in different types of synthetic macromolecules [8][9][10] . Overall, molecular encryption in polymers opens up interesting avenues for massive information storage 1 as well as long-term data storage 11 . Importantly, the use of digitally encoded polymers allows room temperature storage and already gives access to substantial storage capacities. However, the development of practical technologies is currently still limited by relatively slow writing and reading speeds 1 . Various sequencing methods, including tandem mass spectrometry (MS/ MS), enzyme-assisted approaches, and nanopore threading, can be used to decipher the coded sequences of biopolymers and man-made macromolecules [12][13][14][15] . For biopolymer sequencing, however, the molecular structure of the analyte is fixed by biology and therefore quicker analysis can only be attained through the development of advanced analytical methods. The use of synthetic digital polymers offers an alternative scenario, which is that the molecular structure of the polymer can be tuned to facilitate sequencing using routine analytical instruments 16 .
Here, we report that MS sequencing of long-coded polymer chains can be achieved through careful macromolecular design. Poly(phosphodiester)s chains containing several bytes of information were synthesized and sequenced. Two important molecular features are implemented in the analyte design: alkoxyamine groups are placed between the bytes and each byte is labeled by a mass tag. In collision-induced dissociation (CID) conditions, the weak NO-C bonds are selectively cleaved, thus leading to a series of mass tag-shifted intact bytes. Afterwards, each byte can be individually activated and easily deciphered by MS 3 . Consequently, full sequence coverage can be obtained in a single measurement performed in a moderate resolution mass spectrometer.

Results
Design and synthesis of the digital polymers. We have recently reported that non-natural digital oligomers constructed with repeating units containing alkoxyamine 9,17 or carbamate 10 linkages are very easy to sequence by MS/MS due to the low-energy fragmentation of these bonds. However, in these macromolecules, bond cleavages occur between each bit, which is valid for singlebyte oligomers 17,18 but might become much more challenging for multi-byte polymers. To favor the sequencing of long digital chains while avoiding complexity issues commonly associated with MS/MS of highly charged chains, we propose in the present work to use a two-stage controlled fragmentation strategy, in which weak bonds are placed between each byte (Fig. 1a) instead of each bit. This concept was applied here to digital poly(phosphodiester)s 8,19 that are synthesized by automated phosphoramidite chemistry [20][21][22] . In such polymers, digital information is written using two monomers that contain either a propyl phosphate (0) or a 2,2-dimethylpropyl phosphate (1) synthon, as shown in Fig. 1b 8,19 . Although multi-bytes chains can be synthesized relatively easily 19 , MS/MS sequencing of long poly (phosphodiester)s containing only repeating phosphate linkages is tedious, as illustrated in Supplementary Fig. 1 by the spectrum of a 4-byte polymer (Supplementary Table 1, Entry 1). However, this spectral complexity is mainly due to an efficient cleavage of all phosphate bonds, leading to eight fragment series that can all be usefully employed for full sequence coverage, in great contrast to MS/MS of DNA, which mostly generate w-type ions as well as numerous useless secondary fragments ( Supplementary Fig. 2). To simplify this situation, cleavable NO-C bonds were incorporated in between each byte in the present work. The concept relies on the fact that a NO-C bond requires less energy than a phosphate bond to be broken in CID conditions. Thus, as schematized in Fig. 1d, inter-byte NO-C bonds shall break first during the first activation stage and lead to a MS 2 spectrum containing all intact individual bytes. Then, subjecting each byte to a second activation stage should yield MS 3 spectra that would allow an easier sequencing task, thanks to the small size and charge state of dissociating species. However, since bytes may be isobaric (i.e., contain the same number of 0 and 1 units), each byte shall be first labeled with a tag, which allows its identification in MS 2 and permits to know its location in the initial sequence. Although a wide variety of molecules could be potentially used as byte tags, natural (noted A, T, G, C) and non-natural (noted B, I, F) nucleotides were selected in the present work (Fig. 1c), because the corresponding phosphoramidite monomers are commercially available. As shown in Fig. 1c, the molecular structure of each byte tag was selected in order to create an unequivocal mass and isotopic signature for each byte. In particular, two simple criteria shall be fulfilled: the molar mass of a given byte tag shall not be a multiple of 28, which is the mass difference between a 0 and 1 synthon, and the mass difference between 2-byte tags shall not be a multiple of 28 and shall also not be smaller than 3 Da since triply charged species (vide infra) are studied in MS 3 . The byte tag sequence, T, C, A, G, B, I, F, no tag, was chosen to label the bytes and reading was arbitrarily started from the non-marked byte (i.e., opposite to the sense of synthesis). This means, for example, that the first, penultimate, and last bytes of a sequence always contain no tag, a C-tag, and a T-tag, respectively. Hence, after performing MS, MS 2 , and MS 3 steps, the whole sequence of the original polymer can be comprehensively reconstructed.
In order to verify the feasibility of this concept, model polymers containing only 2 bytes of information separated by one single cleavable NO-C site were first studied (Supplementary Table 1, Entries 3, 4). In particular, two different alkoxyamines containing either a mono-or a di-methylated carbon were considered. These cleavable groups were incorporated in between the bytes during polymer synthesis using phosphoramidite monomers a1 and a2 ( Supplementary Fig. 3). It was found that the mono-methylated alkoxyamine a1 is not optimal because the energy threshold for the NO-C bond homolysis appears to be close to that of inner-byte phosphate bond cleavage, thus resulting in a polluted MS 2 spectrum ( Supplementary Fig. 4a). On the other hand, the more labile NO-C linkage in dimethylated a2 leads to preferential fragmentation in MS 2 conditions, where phosphate bonds do not break ( Supplementary  Fig. 4b). Yet, it should be remarked that a2 leads to some slight in-source byte fragmentation in the initial MS analysis of the complete polymer. However, these ions can be easily tracked and disclosed based on their much lower charge state compared to the intact macromolecule. Thus, alkoxyamine a2 was selected for the synthesis of digital poly(phosphodiester)s containing 4, 5, 6, or 8 bytes of information (Supplementary Table 1). All these polymers were synthesized by automated phosphoramidite chemistry. For each monomer attachment, cycles involving dimethoxytrityl (DMT) deprotection of the reactive alcohol sites; phosphoramidite coupling; oxidation of the formed phosphite into a phosphate; and capping of the unreacted alcohol sites by reaction with acetic anhydride were used. Although the latter capping step is not mandatory for the synthesis of short oligomers, it is crucial for the synthesis and purification of long sequence-defined polymers 19 . After synthesis, the DMT-terminated digital sequences were cleaved from the support and purified on a reverse-phase cartridge. This purification process allows separation of the desired DMT-terminated sequences from failure sequences capped by acetic anhydride 19 . Afterwards, the terminal DMT group is removed. As shown in Supplementary Table 1, all polymers were obtained in good yields after reverse-phase column purification.
Mass spectrometry sequencing of the multi-byte polymers. Figure 2 shows the three-stage analysis of a polymer containing 4 bytes of information (Supplementary Table 1, Entry 5). This polymer was first analyzed by negative mode electrospray ionization (ESI)-MS, which revealed a dominant charge state distribution (Fig. 2a). Among all detected polyanions, the dodecaanion [M-12H] 12− was selected for MS 2 : because it contains, on average, three deprotonated phosphate groups per byte, its dissociation leads to byte fragments at a preferential −3 charge state. It should be specified that MS 2 experiments were not performed here for sequencing purposes, and traditionally defined as the production of fragments that differ in mass by a single building unit and hence allow the original chain to be reconstructed. Instead, activation of precursor ions aims here at producing fragments that all contain a single byte (to be further sequenced in MS 3 ). However, since dissociation of precursor with n bytes proceeds by competitive NO-C bond cleavages, primary product ions contain from 1 byte (either the first or the last one) to n−1 (3)  Fig. 1 General concept studied herein for the sequencing of long digital polymer chains. a Molecular structure of the sequence-coded polymers prepared by automated phosphoramidite chemistry. These digital polymers contain n + 1 coded bytes noted in red. A byte is a sequence of eight coded monomers that represent 8 bits. Two consecutive bytes are separated by a linker noted in black, which contains a NO-C bond that can be preferentially cleaved during MS/ MS analysis. In order to sort out the bytes after MS/MS cleavage, n bytes of the sequence are labeled with a mass tag noted in blue. b Molecular structure and mass of the two coded synthons that define the binary code in the polymers. c Molecular structure and mass of the mass tags that are used as bytes labels. In order to induce identifiable mass shifts after MS/MS cleavage, the mass of a byte tag (noted in blue) shall not be a multiple of 28, which is the mass difference between a 0 and a 1 coded unit. In addition, the mass difference between two tags (noted in grey) shall not be a multiple of 28. d Schematic representation of the mass spectrometry sequencing of a digital polymer containing 4 bytes of information. The polymer is first analyzed in MS/ MS conditions, which lead to the favored cleavage of the weak NO-C bonds (depicted in yellow inside the grey spacers). Since they carry mass tags, the resulting cleaved bytes are sorted out by mass in the MS/MS spectrum (the displayed MS 2 cartoon is idealized for clarity). Afterwards, each byte can be easily sequenced in MS 3 conditions and the whole binary sequence can be deciphered bytes. Collision energy was hence raised to promote consecutive dissociations of large primary product ions, in order to form secondary fragments that each contains one inner-chain byte. Energy has, however, to remain below dissociation threshold of phosphate bonds to prevent inner-byte fragmentation. In such conditions, a very clear bytes pattern can be observed in the resulting MS 2 spectrum (Fig. 2b). Each byte appears predominantly as a trianion (while remaining as minor signals at −2 or −4 charge states) and the byte tags lead to unambiguous mass shifts that allow identification of their initial location in the chain (Supplementary Table 3). Importantly, the concept also works for polymers containing similar or isomeric bytes. For instance, Supplementary Fig. 5 shows the sequencing of a polymer containing four times the same byte (Supplementary Table 1, Entry 6). Even in such a case, the byte tags allow unequivocal detection of each byte in MS 2 . Fragments resulting from partially cleaved chains, hence containing either two or three bytes and, respectively, detected at −6 and −9 charge states, were also observed in MS/MS spectra, as exemplified in Fig. 2b. After MS 2 byte cleavage, each byte was individually sequenced by MS 3 . In order to take advantage of the resolving capabilities offered by orthogonal acceleration time-of-flight (oa-TOF) mass analyzers to safely assign fragments, the Q-oa-TOF instrument used here for MS and MS 2 stages was also employed to perform pseudo-MS 3 experiments. Typically, deprotonated polymers were first activated in the instrument interface by raising the cone (or skimmer) voltage to perform in-source CID; then, so-released byte fragments were mass selected in the quadrupole for further excitation in the collision cell and fragment measurement in the oa-TOF. Figure 2c shows, for example, the sequencing of the  T-tagged last byte of polymer 5 and the sequencing of bytes 1-3 is shown in Supplementary Fig. 6. Hence, the complete digital sequence of the polymer was easily deciphered. In order to evidence the universality of this concept, polymers with other 4-byte sequences (Supplementary Table 1 Table 1, Entries 8-10) and confirm that messages encoded in longer polymers can also be easily deciphered using the proposed multi-step sequencing. It is, however, interesting to point out that 5-byte polymers were only labeled with natural byte tags A, T, G, C, whereas the 6-byte polymer also contains a non-natural byte tag B. The latter contains a bromine atom and, therefore, does not only allow byte identification by mass but also by isotopic pattern. Ultimately, an 8-byte polymer (Supplementary Table 1, Entry 11) was synthesized and analyzed. Figure 3 shows the MS 2 spectrum obtained from the precursor anion containing 24 negative charges (i.e., three charges in average per byte). After promoting multiple in-chain NO-C fragmentations, a clear pattern of byte trianions was measured in MS 2 , thus allowing individual byte sequencing by pseudo-MS 3 ( Supplementary Figs. 21-25).

Discussion
The development of practical polymer-based digital memories requires libraries of macromolecules that contain at least a few bytes of data in each chain 1 . Sequence-coded polymer chains containing more than a hundred bits can be synthesized using automated solid-phase chemistry 19 or inkjet technologies 23 . However, MS sequencing of such long chains is very challenging, in particular, when targeting full sequence coverage as mandatory for digital polymers. Indeed, assuming an efficient collisional activation, the total ion current is spread over a large number of fragments. Moreover, these fragments are most often produced at different charge states because they are formed from highly charged macromolecules generated by ESI, a soft technique used to ensure their structural integrity. For instance, when full sequence coverage is required, it is known that MS/MS is most efficient for the sequencing of peptides containing less than 20 residues and that the analysis of longer proteins requires enzymatic digestion and HPLC separation of complex mixtures 24 . Similarly, MS/MS sequencing of intact nucleic acids is not trivial 25 . Such a situation can hardly be improved, as the molecular structure of biopolymers is set by biology.
In contrast, as demonstrated in this article, synthetic polymer chemistry allows design of digital polymers that may undergo controlled fragmentations in MS/MS conditions. Indeed, the experimental data highlighted in this paper indicate that intact long sequence-coded chains can be fully sequenced in a routine mass spectrometer operating at low collision energy, without requiring digestion, purification, or separation steps prior to analysis. The use of databases is also not mandatory in this approach to decipher the coded sequences. To attain such a MS readability, the molecular structure of the polymers was carefully engineered and, in particular, two key features were implemented: the use of cleavable inter-bytes spacer that promotes programmed MS 2 fragmentation and the use of mass tags that allow identification of byte original location. The former feature allows the actual MS sequencing task to be limited to very short (8 bits) oligo(phosphodiester)s, while the latter one permits reliable reconstruction of the byte sequence. It should be noted that the reported concept is not limited to poly(phosphodiester)s and could be extended to other types of digital polymers. For instance, alkoxyamine groups are most probably not the only type of interbyte links that can be used in this approach. Depending on the type of digital polymer that shall be deciphered by MS, other cleavable linkers may be imagined. In fact, the general rule is that  the inter-byte linkers require less energy for being broken in CID conditions than the intra-byte bonds that connect the bits. Furthermore, byte tags shall not necessarily be nucleotides. As mentioned in the results section, these markers have been selected in the present work because their phosphoramidite derivatives are commercially available. Yet, other byte tags can be envisioned.
Here, the general rule is that the molar mass of a byte tag and the mass difference between two byte tags shall not be a multiple of the mass difference between 2 bits. In addition, markers with markedly different isotopic signatures can also be imagined.
Overall, this work opens up interesting perspectives for the design of polymer-based molecular memories 1 . For such technological applications, it is important to specify that the synthesis of very-long-digital polymers (i.e., linear chains containing several hundreds of coded bits) is not an objective. Indeed, polymerbased memory devices will most probably rely on libraries of coded chains, as already done in the field of DNA storage 4, 5 . In such libraries, individual chains containing about 100 coded residues and a short localization address sequence are typically used and permit to store large quantities of information 4 . In addition, coding theory 7 and enhanced monomer alphabets 16 can be used to increase storage density in short segments. In this context, the results of the present paper underline that an important milestone in terms of chain length has now been reached for synthetic digital polymers. Indeed, it is now possible to encode and comprehensively decode long non-natural chains, as demonstrated in this work with a chain containing 78 residues (64 bits, 7 tags, and 7 spacers). The next important challenge in the field of polymer-based data storage will, therefore, probably be the development of organized and accessible digital polymers libraries.
Monomer synthesis. Supplementary Fig. 3 shows the molecular structures of the phosphoramidite monomers that were used in this work to synthesize digital polymers. The phosphoramidite monomers 0 and 1 were used as binary coding units and were synthesized as described in a previous publication 8 . The byte tags A, T, C, G, B, F, and I are commercial nucleotides, as described in the previous paragraph. The alkoxyamine-containing monomers a1 and a2 were synthesized following the strategy depicted in Supplementary Fig. 26.  6 Hz, 12H, -N-(C-(CH3)2)2). 13  . N-(3-(bis(4-methoxyphenyl) (phenyl)methoxy)propyl)-2-((4-hydroxy-2,2,6,6-tetramethyl-piperidin-1-yl)oxy) propanamide (b1). The compound c1 (1.96 g, 6.5 mmol) was co-evaporated with 5 ml of anhydrous pyridine. Afterwards, 10 ml of anhydrous pyridine and 12 ml of anhydrous THF were introduced and reacted with 4,4-dimethoxytrityl chloride (DMTCl) (2.22 g, 6.5 mmol). The DMTCl was added in three portions at a rate of one portion every 45 min. After the three additions, the mixture was stirred at room temperature for 2 h. The reaction was stopped with the addition of 5 ml of methanol and the mixture was evaporated to dryness. The residue was dissolved in ethyl acetate (40 ml) and washed with 5% sodium bicarbonate ice-cold solution. The aqueous layer was extracted with 40 ml of ethyl acetate. The combined organic layers were washed with water and brine, dried with anhydrous Na 2 SO 4 and evaporated. The resulting product was chromatographed on silica gel (50% ethyl acetate in cyclohexane with 1% triethylamine) yielding 3.