Multicomponent molecular memory

Multicomponent reactions enable the synthesis of large molecular libraries from relatively few inputs. This scalability has led to the broad adoption of these reactions by the pharmaceutical industry. Here, we employ the four-component Ugi reaction to demonstrate that multicomponent reactions can provide a basis for large-scale molecular data storage. Using this combinatorial chemistry we encode more than 1.8 million bits of art historical images, including a Cubist drawing by Picasso. Digital data is written using robotically synthesized libraries of Ugi products, and the files are read back using mass spectrometry. We combine sparse mixture mapping with supervised learning to achieve bit error rates as low as 0.11% for single reads, without library purification. In addition to improved scaling of non-biological molecular data storage, these demonstrations offer an information-centric perspective on the high-throughput synthesis and screening of small-molecule libraries.


Library Synthesis
To make a library, the reagents, shown in Figure 1, are manually prepared as 500 mM solutions in dimethyl sulfoxide (DMSO) and 60 µL of each is pipetted into a 384-well plate. To level the fluid menisci, we spin the plate in a centrifuge for five minutes at 2,500 rpm. The reagent plate is then placed into an Echo 550 acoustic liquid dispenser, surveyed to check volume levels, and finally used to prepare the reaction wells according to a pre-generated list of transfers. Maps indicating where each reagent was placed on an example library plate are shown in Figure 2e. The reagents are added by class, in the following order: amines, aldehydes, carboxylic acids, and lastly isocyanides. The resulting reaction wells, filled with 200 nL of each reagent solution (800 nL in total), are left to react. After an incubation period of about 24-48 hours, solvent (DMSO) is added to bring the final product volumes to 4 µL. Images of the reagent, product, and MALDI plates for the Ugi library used in this work are provided in Figure 2. Additionally, the molecular structures of all 1,500 expected products are shown in Figure 3. We generally use these library elements for data storage without further purification. Over the course of this work, we have run over 10,000 Ugi reactions, up to 1,500 at a time.   Supplementary Figure 3: The molecular structures of all 1,500 expected compounds in the Ugi library. 6

Library Subset Selection
The data storage examples described have all used a subset of the compound library, due in large part to the limited write speed. To select a subset, we begin with the list of library products having confirmed M+Na peaks. In the example below, 90% (1346/1500) were detected. We then sort these compounds by sodiated peak intensity. Starting from the strongest peak, we include each successive compound when its monoisotopic mass has a distance of at least 0.008 Da from any previously selected compounds. This is repeated until the desired number of compounds are obtained. An overview of the selection process is given in Figure 4. For the 32 products used in Figure 4, the average mass separation was 3.970 Da and the average SNR was approximately 10,000. Although we have not focused on such scenarios in this work, we note that by using multi-peak detection we can distinguish library elements with overlapping monoisotopic masses.  The mass intensities (SNR, sodiated) of the selected products, alongside that of the rest of the library. c A histogram of all library product signals (SNR, sodiated) and the optimal threshold (31.44) used to assess product presence. d A histogram of the differences between adjacent compound masses which illustrates the degree of mass diversity among library elements. Some products had non-unique masses and are not displayed on this log-scale chart. e A flow chart of the library refinement process.

Other Synthesized Ugi Products
For preliminary testing and calibration, several Ugi products were synthesized on a larger scale and purified using a silica column. A table of these products and the reagents used to make them is shown in Figure 5. One of these purified products (F13) is included in all samples throughout this work as a reference mass to correct for sample-to-sample m/z drift. The others were used to test experimental conditions, such as laser power, sample and matrix concentration, and sample wear-out.  To write a dataset, compounds from a library plate are used to create mixtures corresponding to the data to be stored, by dispensing droplets with an acoustic liquid handler. The time required to write a dataset depends on the several factors, including the size of the data, the encoding scheme, the rate of droplet dispensing, and the time that the liquid handler takes to move between library wells. Figure 6a illustrates the write process. The liquid transfer lists are sorted by library source well and then destination well, in order to save time. Figure 6b and 6c plot the instantaneous and cumulative transfers from a library well to many locations on the Anubis data plate. The transfer rate was about 4 compounds/sec, which corresponded to a write speed of 18 bits/sec for this particular dataset. Compounds from a library plate are used to generate mixtures corresponding to the digital data, by dispensing droplets with an acoustic liquid handler. b The library compound and data mixture wells addressed for each transfer, within a hundred second time-frame. c The cumulative number of compound transfers performed over time. Inset: An image of the produced data plate.  Figure 7: A detailed summary of the data written into Ugi products mixtures. Experimental parameters are provided along with the original and recovered digital images. The images were adapted from artwork from The Metropolitan Museum of Art (Anubis [1], Dimna [2], Angels [3]) and [4] the c Estate of Pablo Picasso/Artists Rights Society (ARS), New York (Violin [4]).

Long-Term Storage
During the normal course of our experiments, we measure data plates within two weeks of writing them. However, the datasets are stable for much longer than that. For example, one of our earliest data plates (Ibex, Figure 8), has been read multiple times over the course of its year-long existence with no significant change in readout quality. Because the solvent has been evaporated and the molecules are trapped within matrix crystals, the stored data should remain intact as long as the samples are kept under reasonable environmental conditions. Some parameters which may affect the lifetime of stored data include: • Temperature: The melting points of the matrix and encoding compounds represent a limit to the temperature a data plate can be exposed to before crystal integrity becomes compromised. In our experiments, the matrix (HCCA) melting point is 252 • C [5]. The Ugi products are highly stable, due to their peptide-like amide bonds [6], and are reported to have melting points around 200 • C [7,8]. Simulations using OCHEM models [9,10] predict the average melting and boiling points of the synthesized Ugi library to be around 113 • C and 810 • C, respectively.
• Humidity: Since both HCCA and the Ugi products are soluble and stable in water, individual spots should be able to take on water for a short period without affecting the data they store, though the samples should be dried prior to reading. The amide bonds in the Ugi products will slowly degrade by hydrolysis over time [11,12]. This process can be drastically accelerated in the presence of certain enzymes [6]. As such, keeping the samples in a dry and sterile environment is recommended.
• Light and radiation: The Ugi products, specifically their amide bonds, are to some extent susceptible to dissociation under UV and X-ray radiation [13], and hence should be stored in a dark or LED lit room, without direct sunlight.
While a formal study of the storage lifetime is beyond the scope of this work, accelerated environmental testing of crystallized samples [14] could be performed at or below 100 • C, with around 50% relative humidity, and under mild UV illumination to extrapolate dataset lifetimes. Supplementary Figure 8: Recovery of data after long-term storage. The binary image stored here was adapted from a 13 th century Egyptian block print of an ibex [15]. In between measurements, the plate was stored in a transparent, unsealed cabinet in our mass spectrometry room at around 20 • C. Note the lower accuracy of the earlier read was due to calibration issues with our mass spectrometer that were later resolved.

Comparison to Related Work
In this work, we store the largest amount of data among small-molecule information demonstrations to date. Even compared to more mature DNA memory, this work falls near the median of recent data storage examples. Figure 9 plots the total data encoded in over a dozen relevant demonstrations from the past ten years.

Spectrum Acquisition
To analyze the chemical makeup of many samples per day, we use a Fourier-transform ion cyclotron resonance (FT-ICR) mass spectrometer with matrix assisted laser desorption ionization (MALDI). During a measurement, a small fraction of a each spot's material is removed by a laser, and ions are excited into orbit in a high vacuum and a strong magnetic field. The orbital frequencies are a function of the mass and charge of each ion: f = Bz 2πm , where B is magnetic field strength [32]. The mass-to-charge ratio (m/z) of the ions can be found by taking the Fourier transform of the detected time-domain signal. Because the ions can be kept in orbit for several seconds, corresponding to millions of orbital cycles, FT-ICR mass spectra have exceptionally high resolution, often reaching peak widths of 0.001 Da or less. Here we typically acquire spectra for 1.5 seconds which results in a resolving power of 1.3 × 10 5 near 600 Da ( Figure 15).

Mass Calibration
To ensure the mapping from cyclotron frequency to mass-to-charge ratio is done as accurately as possible, the mass spectrometer is calibrated with a sample having several known masses across the measurable range of 150-3000 Da. We perform this calibration prior to each MALDI plate measurement, using a solution of sodium trifluoroacetate (NaTFA): 0.05 g L −1 NaTFA dissolved in a 1:1 water:methanol mixture. The solution is injected into the instrument via electrospray ionization (ESI) and peaks are observed across the spectra which correspond to Na x (CF 3 CO 2 ) y . The measured positions of these peaks are then adjusted to their known values with a quadratic model, as are the rest of the m/z values. The results of an example m/z calibration are shown in Figure 10.
Additionally, a small offset calibration is performed within each spectrum using a common known mass. We spike the HCCA matrix solution with a manually purified Ugi product (F13) at a concentration of about 8 mM. This peak, commonly referred to as a mass lock, appears in all measured spectra, and is used to correct for fine offsets in each measured spectra, ±0.01 Da or less.  Figure 10: A typical m/z calibration based on NaTFA and performed in electrospray ionization (ESI) mode. a The mass spectrum of NaTFA after calibration. The peaks used for calibration are marked with a triangle above them. The inset shows one of these peaks, both before and after calibration. b A plot of the m/z residuals, the difference between measured and expected, before and after quadratic calibration. c A plot of the absolute m/z error, in parts per billion (ppb): residual/reference ×10 9 , post-calibration. A linear trend-line, y = 0.02248 · x + 2.473, is shown to emphasize the rise in error with increasing mass. d A plot, post-calibration, showing how peak width (FWHM) and resolving power (peak center over width) vary with m/z.

Spot Alignment
Instead of recording mass spectra over a perfectly rectangular grid, we minimize positioning errors in MALDI plate readout by imaging each plate with a flatbed scanner. We process the scanned images to automatically detect the precise location of spots using the Hough transform [33] (Figure 11). Standard spot addresses, such as "X01Y03", are assigned by proximity to an expected grid. The resulting position map is loaded onto the mass spectrometer prior to running MALDI measurements.

Repeated Reads
Each read from a MALDI spot removes a portion of the original sample. To quantify how a spot degrades after repeated measurement, we performed a series of spot burnout tests at various laser powers ( Figure 12). The Ugi product (purified H6) and matrix concentrations in these samples were 1.56 mM and 88 mM, respectively. Each measurement was performed with 500 shots of the laser which was at medium focus and set to fire at 1000 Hz, while the scan region width was set to 500 µm. For most of the tested laser powers, the product could still be detected after 100 repeated reads.
Using the burned out region, we can estimate the amount of matrix material ionized per spectral recording. From the Anubis data plate, the height of a dried spot was coarsely measured, using in-plane microscope imaging, to be about 10 µm. Assuming the spots here are roughly the same height and that the majority of this was irradiated during the burn tests, then volume of the burned out region is 0.00239 mm 3 . Since this volume was lost over the course of 100 reads, each measurement uses 23,900 µm 3 of the sample, which amounts to 0.24% of the total spot volume. Based on this ablation rate, the amount of Ugi product and matrix ionized per read is about 10 fmol (6 pg) and 13 pmol (2 ng), respectively. This approach could readily scale down to spatial dimensions of tens of microns without a loss in spectrum quality.    Figure 12: A series of MALDI spot burnout tests performed at various laser powers. a Images of one of the spots before and after all 100 repeat reads at 12% power. The final ablated area, the dark region within the bright elliptical spot, is 0.239 mm 2 which is 24% of the total spot area. b The gradually weakening Ugi product peak (sodiated) as its spectrum is repeatedly measured at 9% laser power. c Ugi product peak intensity over the course of 100 reads. These repeated read tests were performed at differing laser powers, each with their own separate MALDI spot of identical composition. The dashed gray line, at 12-σ, represents a conservative bound for distinguishing signal peaks from background noise.

Signal-to-Noise Estimation
Spectral offset (µ) and background noise level (σ) are estimated as the mean and standard deviation, respectively, of the spectrum points below a certain intensity threshold. The signal cutoff is chosen to exclude strong peaks, which are effectively outliers, from the rest of a spectrum's points. We define this threshold (I t ), per spectra, as the upper Tukey fence for extreme outliers [34]: where Q 1 and Q 3 are the first and third quartiles of spectrum intensity, respectively. The signal-to-noise ratio (SNR) is defined for each spectrum intensity I, as: where µ is the mean and σ is the standard deviation of background intensities, which are defined as those below the threshold: I < I T . Using SNR instead of raw intensity tends to help normalize spectral recordings across measurement runs and plates, especially when different acquisition settings or sample concentrations are used. We generally find that peaks with SNR values at or above 12-σ can be confidently identified against the background.

Adduct Ions
Measured spectra frequently contain the sodiated (M+Na), protonated (M+H), and potassiated (M+K) variants of our Ugi products (see Figure 3 of the main text), but these make up a small fraction of the overall list of peaks in a MALDI mass spectrum. In general, it is difficult to predict the composition of adduct ions that will arise in a mixed sample. However, we can work backwards to try to identify some of the statistically useful features found with our supervised learning approaches.
To do so, we assembled a list of potential ions that may appear in our spectra based on adducts commonly seen in soft ionization [35], as well as those we have observed or suspect may form, such as adducts containing multiple molecular ions (nM), protons (H), common alkali metal ions (Na/K), matrix molecules (nHCCA), solvent (nDMSO), leftover reagents (AmiX/AldX/CarX/IsoX), and other library compounds (nUgiX). In Figure 13 we plot the m/z features found with random forest regression, for three Ugi library elements, and tentatively label some of their prominent features. For example, the M+Na peak is useful for identifying the presence of Ugi 188, while Ugi 1439 is better identified by ion complexes containing unreacted starting materials. In all three plots, we find features that seem to correspond to other library elements, which we suspect is due to the competitive ionization of compounds in our mixed samples.

Parametric Sweeps
There are numerous experimental parameters that can be adjusted in these MALDI mass spectrometry measurements, such as acquisition duration and mass range, laser settings (power, diameter, frequency, shot count), scan settings (walk type, width, grid increment), and spot composition (sample and matrix concentrations, masses, spot size). Some of the tests that were done to select key parameters are shown here. Fortunately, the Ugi products in our library tend to respond similarly, and the majority of these parameters can be kept constant once a satisfactory configuration is found. However, to account for run-to-run and plate-to-plate variability, and to ensure high signal-to-noise ratios, a few parameters, such as laser shot count and power, are often manually tuned before measuring an entire plate, by selecting a single spot on the plate and tuning its total ion current (TIC) to the 10 8 − 10 9 range. If the TIC is made much higher than this, spectra begin to degrade due to increased ion-ion interactions [36].    b The spectra resulting from transforming the above raw signals into the mass-to-charge domain. An inset with intensity on a log-scale is shown to highlight the widening of peaks with decreasing measurement time. c A narrow region from the spectra shown in b, which depicts how runtime affects the shape of a compound's peak. d Peak width (FWHM) and resolving power for each acquisition duration, as measured from the traces shown in c.  Figure 16: A study sweeping laser power and Ugi product (H6) concentration. a A collage of the tested samples showing, from left to right, an image of the crystallized MALDI spots, a total ion current (TIC) map for each spot's spectra, and an intensity map for the sodiated peak of the swept Ugi product. Each tested laser power and Ugi product concentration can be read from the given axis. b A 3-D surface visualization of the Ugi product's sodiated peak intensity across the swept parameter space. c A plot of peak intensity as a function of laser power for each tested Ugi concentration. d A plot of peak intensity as a function of Ugi concentration for each tested laser power. Intensity [arb.] 12   Supplementary Figure 17: A study sweeping matrix (HCCA) and Ugi product (H6) concentration. a A collage of the tested samples showing, from left to right, an image of the crystallized MALDI spots, a total ion current (TIC) map for each spot's spectra, and an intensity map for the sodiated peak of the swept Ugi product. Each tested matrix and Ugi product concentration can be read from the given axis. b A plot of peak intensity as a function of Ugi concentration for each tested matrix concentration. c A plot of peak intensity as a function of matrix concentration for each tested Ugi concentration. d A 3-D surface visualization of the sodiated Ugi product peak's intensity across the swept parameter space. e The critical matrix concentrations and ratios for each Ugi compound concentration.

Ugi Product Concentration
The critical values were found as the log-scale midpoint between the high and low intensity regimes of the traces shown in c. 24