Identifying molecules as biosignatures with assembly theory and mass spectrometry

The search for alien life is hard because we do not know what signatures are unique to life. We show why complex molecules found in high abundance are universal biosignatures and demonstrate the first intrinsic experimentally tractable measure of molecular complexity, called the molecular assembly index (MA). To do this we calculate the complexity of several million molecules and validate that their complexity can be experimentally determined by mass spectrometry. This approach allows us to identify molecular biosignatures from a set of diverse samples from around the world, outer space, and the laboratory, demonstrating it is possible to build a life detection experiment based on MA that could be deployed to extraterrestrial locations, and used as a complexity scale to quantify constraints needed to direct prebiotically plausible processes in the laboratory. Such an approach is vital for finding life elsewhere in the universe or creating de-novo life in the lab.


Complexity as a Biosignature
Biosignatures are defined as "object[s], sub-stance [s], and/or pattern[s] whose origin specifically requires a biological agent"(1, 2). Here we focus on evaluating the plausibility of molecular artefacts (or objects) as biosignatures. For reviews relevant to evaluating atmospheric patterns as candidate biosignatures we refer the reader to recent work by Walker et al. (3) and Schwieterman et al. (4). Previous authors have suggested using specific biomolecules, such as lipids (5) or nucleic acids (6) to identify the living systems. Unfortunately, relying on specific organic molecules prohibits us from detecting life based on non-terran biochemistry. Others have suggested using homochiral polymers as a universal biosignature (7), however abiotic processes are known to produce enantiomeric excesses that could result in false positives (8). Finally, isotopic fractionation has been posited to distinguish biologically generated material from abiotic material (9). However, as pointed out by Neveu (10) isotopic fractionation can also be generated by abiotic processes, and effective evaluation of samples requires prior knowledge of metabolic pathways, restricting its applicability to known life.
Living systems are able to generate complex molecules in a way that is not possible for abiotic systems, and so a complexity-based model is a promising prospect for use as a molecular biosignature. We propose that a good molecular complexity measure for the purpose of life detection should satisfy three criteria. Firstly, the model would need to reflect the pathway of formation of a molecule, providing a correlation with or a bound to the likelihood of overcoming the combinatoric explosion of diversity which results from random interactions, thereby providing a distinction between potentially abiotic molecules, and those that required a biological influence to form. Secondly, a good complexity measure needs to be conceptually simple and intrinsic, with minimal external choices required. We cannot take into account all of the rules of chemistry, environmental conditions, and multifaceted interactions that molecules can undergo without generating a complexity model that is too convoluted to use. Additionally, we want to avoid imposing external weightings that do not necessarily correlate consistently with likelihood of abiotic formation, such as ring counts, or the presence of specific functional groups or heteroatoms. Finally, for use in life detection, there must be a consistent experimental predictor, so that we can analyse unknown molecular samples and determine their complexity.
The determination of molecular complexity has been extensively explored theoretically, with many metrics devised based on structural, topological, or graph theoretical complexity. These include measures based on specific graph features, such as counts of atoms/bonds (11), distances between atoms in the molecular graph (12,13), paths through the molecular graph (14) and total walk counts(15), connectivity of atoms (16,17), number of subgraphs (18), fractal dimensions (19), and information theory based on molecular symmetry (12,20,21). Other complexity measures rely on weighting for specific molecular features such as the number of rings, heteroatoms, and properties such as electronegativity (22,23). Complexity measures have also been proposed which use machine learning (24,25), and crowdsourcing (26).
No molecular complexity measure proposed to date fully fulfils the criteria we propose above.
There is no complexity measure that we are aware of that capture potential history of formation for molecules. Some measures incorporate intrinsic features that intuitively add complexity, such as the number of molecular fragments (subgraphs) and atom connectivity, or lower it such as symmetry, these measures do not track how likely such features are to form by bringing fragments together one step at a time. This allows for a potentially large increase in complexity at low combinatoric cost, for example connectivity indices could increase dramatically in a single step, or symmetry could quite easily be broken in a single step, resulting in a discontinuity between the complexity measure and the effort required to increase the molecule's complexity. We can also disregard any measures that count specific features, or include current synthetic difficulty, as these are externally weighted and cannot be shown to be useful for life detection in an agnostic sense. This is also true for machine learning and crowdsourced based measures, as those models are restricted by the chemistry of life observed so far on earth, and we have no way of telling if they could be used to threshold life detection in general. Finally, we do not know of any molecular complexity model published to date that has a strong correlation with experimental data. The model proposed in this paper fulfils these three criteria, allowing for experimental analysis to indicate a complexity value based on a simple, intrinsic, agnostic measure, that can be shown in theory to bound the biological threshold.

Computing Molecular Assembly, Algorithm implementation details
In a previous paper (27), we introduced the concept of Pathway Complexity (now renamed Object Assembly) as a model by which we could determine if an object had a biological origin.
Here we review the concept of Object Assembly, with a focus on its application to molecules, and introduce an algorithm used to calculate the Object Assembly Indices of molecular graphs

Theory of Object Assembly
The Object Assembly Index (OA) of an object is defined in the context of an assembly space (28), which defines how objects can be made from a set of basic building blocks through combination operations. Each point in the assembly space is an object, and arrows between objects A and C are labelled by another object B with the implication that A and B can be combined in some predetermined way to make object C (See Supplementary Figure 1). There is required to be a symmetric arrow in the space between objects B and C, labelled with object A. Traversal along arrows in the assembly space represents a series of joining operations. An Assembly Subspace is a subset of objects and arrows that itself constitutes an assembly space. A subspace that contains the irreducible building blocks of the parent assembly space (is "rooted"), and contains a target object X, can be thought of as containing a recipe to create X using joining operations. The OA of X is defined as the size of the smallest rooted Assembly Subspace containing X. The OA can be thought of as the minimum number of joining operations required to create X, starting from basic objects, where objects created in the initial steps can be "re-used" in subsequent joining operations. Concepts related to Assembly Spaces and the Assembly Index are formalised in Ref. (28).
Supplementary Figure 1: An assembly space for object that can be created from blue and white blocks. Some arrows have been omitted for clarity. The label on the arrow represents the object in the space that needs to be combined with the source to make the target. The dashed region represents an assembly subspace, which is the smallest subspace that contains the object made of a row of 4 blue blocks. The assembly index of that object is the number of objects in that subspace, not including basic objects.
Intuitively, the OA of an object is correlated positively with its size, and negatively with the number of repeated and non-overlapping substructures along the minimal pathways. Any such substructures could themselves contains repeated substructures, further reducing the OA, recursively. Objects with low OA are those objects which are small and/or contain internal symmetries, while objects with high OA tend to be large and heterogenous. An upper bound for the OA of an object of size s is s-1, based on the fact that it is always possible to construct an object by adding a single basic object at each step. A lower bound can be found by considering that at each step it is possible to join the object created in the previous step to itself, and an object created this way in n steps will have size s=2^n, with n being the minimum possible OA of an object of size s. Therefore, log_2s is a lower bound for the OA of any object of size s.
Construction of an object using the object assembly model is designed to mimic the construction of objects through random collisions starting from basic building blocks and determining, in principle, the minimal pathway indicates how many steps this could be accomplished in. This allows us to set a lower bound on how likely the object is to be found in abundance in the context of all the other objects that could have been created through undirected/abiotic interactions.

Molecular Assembly Index
Object Assembly theory has a natural application to molecules, and the model was devised with molecules in mind. Either atoms or bonds can be considered as the irreducible objects, and typically hydrogen depleted representations would be used to reduce computational cost.
In computing the Molecular Assembly (MA), we use a graph-theory based model, considering only the connectivity of atoms and bonds within molecules, restricted by valence rules. A more complex assembly space could be considered whereby atoms and bonds are represented in three dimensions, and a joining step between two structures is only permitted if the resulting structure is chemically feasible. This would be computationally prohibitive with current algorithmic implementations although it may be explored in a future study.
Assembly pathways in the model described here are not representative of molecular synthesis but rather represent what synthesis would be if the complexities of chemistry, other than valence rules, were ignored. If we consider syntheses where all steps are of the form A+B→C, where C is the only product, then the space of synthetic pathways of this type is an Assembly Subspace of the Assembly Space used in our model, containing a subset of the structures and connections between them. In our previous work (28), we have shown that the MA in an Assembly Subspace is an upper bound for the MA in the original space, and hence such synthetic pathways cannot be shorter. In cases where A+B→C+D, or A→C+D, the most complex product will tend to have lower MA than in the case of A+B→C, and so there will be a tendency for steps of these kinds to result in longer synthetic pathways rather than shorter ones. Since most steps in our model will not represent real synthetic steps, with synthesis being significantly more difficult than in our model, we consider the MA of a molecule to be a reasonable lower bound on the shortest synthetic pathway with atoms as starting materials. This opens the possibility that a molecule will have a relatively low MA but could only realistically be synthesised in a much larger number of steps, and hence from a life detection point of view this could result in false negatives. Our intention, however, is that the measure is robust against false positives, allowing us greater confidence in the biological origin of molecules that we find to be above the threshold. Future MA models may incorporate decomposition steps and reactions with multiple products.
The molecular assembly algorithm calculates the split-branched object assembly index, a variant of the Object Assembly Index. This variant was chosen for algorithmic simplicity (28). The split-branch object assembly index of a molecule is an upper bound for the MA of the molecule(28), although there is an offset of 1 between the variants as the initial step of the assembly index is a joining of two basic objects, whereas the initial step of the split-branch process can be thought of as laying down a single basic object. The split-branch variant can be considered intuitively as forming structures in their own separate environments, before bringing them together, and so a substructure used to create one object cannot be used in the creation of a separate object without rebuilding it. In the conventional MA measure, one can think of all the structures forming in the same environment, and such reuse would be permitted. For simplicity, in subsequent paragraphs "MA" should be understood to mean the splitbranched molecular assembly index.
We calculate the MA on hydrogen depleted graphs, and use bonds as basic objects, to reduce computational complexity and allow for simpler representation of molecular fragments. The algorithm takes a molecular graph, and calculates all possible connected substructures, grouping these into fragments that are identical. This grouping is done by associating fragments with their InChI string, using the InChI API, (29) as the InChI string is a canonical representation of a chemical structure (i.e. each chemical structure is represented by a single unique InChI string). Following this, the algorithm searches through partitions of the molecule into non-overlapping substructures, with each unique substructure in a molecule contributing its own MA to the MA of the target, plus 1 for each time it is duplicated. The MA of the substructures is calculated recursively using the MA algorithm, unless it can be determined implicitly due to the substructure size being 3 bonds or fewer (substructures of size 1, 2, and 3 bonds have MA 1, 2, and 3 respectively).
The algorithm uses several methods to reduce the computational expense of the calculation.
Only substructures duplicated at least once are considered in the partitions, with bonds not in those substructures contributing 1 each to the MA. The order of searching through partitions is based on the size and multiplicity of the repeated substructures (e.g. three substructures of size 2, and two of size 4), and minimum/maximum MA values for such partitions can be calculated based on size alone. In this way, substructures can be searched in order of increasing minimum MA, which allows the algorithm to terminate when the minimum MA of a partition based on size/multiplicity is greater than or equal to best MA value found so far. The algorithm to calculate the MA for molecules was written in C++, compiled with the Boost library and InChI API (29), and currently runs natively on Windows, or on Linux if using WINE. It takes an MDL Mol file as input, and outputs a single integer for the MA, as well as some details of the minimal length pathway found. The algorithm terminates when a single minimal length pathway is found, so no information is provided on the number of minimal length pathways.
Supplementary Figure 5: Schematic examples of the Split-Branch calculation of Molecular Assembly Index for penicillin (top) and tryptophan (bottom).

MA in Chemical Space
In exploring chemical space, we distinguish between theoretical chemical space, being the space of all possible molecules, and extant chemical space, being the space of molecules known to have been discovered or synthesised (and documented). To explore extant (known) chemical space we utilize a subset of the Reaxys® database(30) with molecular weight up to 1000 Daltons, which contains approximately 25 million substances. The distribution of molecular weights from this subset is shown in Supplementary Figure 6. To explore possible (or theoretical) chemical space we generated possible chemical structures using MOLGEN(31).

Possible Chemical Space using MOLGEN
In order to estimate the size of a constrained subset of theoretical chemical space, we used the commercial software MOLGEN (31), which enumerates all structures for a given molecular formula, or formula range. MOLGEN commands we used to enumerate molecules took the form: > mgen C6H6 − v which will enumerate all structural isomers with molecular formula C6H6, or Next, we calculated the total number of possible structures containing only C, N, O, S, and H. We calculated up to 9 non-H atoms due to computational constraints, with the total number of structures being approximately 120 million, shown in Supplementary Figure 7 (right) The number of structures for n non-H atoms was approximately ( + 3)!/6, and assuming an increase at this rate would imply that the number of possible structures for 70 non-H atoms would be approximately 10 100 , significantly higher than the estimated number of atoms in the observable universe. Conversely, the number of possible molecules in known chemical space initially increases as size increases from small molecules before dropping off as molecules become larger, less likely to be found in nature, and more difficult to synthesize. Using Reaxys as a proxy for known chemical space, the number of total molecules containing only C, N, O, S and H peaks at about 530k molecules for 24 non-H atoms, before reducing to ~12k for 70 non-H atoms, with no substances having over 82 non-H atoms.

Known Chemical Space in Reaxys
A subset of the 25 million molecules in the Reaxys database(30) was analysed using the MA algorithm. The computational complexity of the algorithm prevented analysis of molecules with MA greater than approximately 27, and there is a bias towards successfully calculating molecules with lower MA value at any given molecular weight. The distribution of molecular weights for molecules with successfully computed MA is shown in Supplementary Figure 6. However, it can still be seen that the MA of molecules tends be more broadly distributed for each molecular weight above a MA of 10-15 (see Supplementary Figure 8). We interpret this spread to indicate that for small (low mass) molecules, the range of MAs for each molecular weight is tightly constrained, due to the limited number of ways small molecules can be constructed, however this effect is removed for heavier molecules. This means that above a certain mass range, the molecular weight of a molecule and its MA effectively decouple. To confirm this effect was not an artefact of the relatively low representation of high MA molecules in the data, we subsampled the data in Supplementary Figure 8, such that the molecular weight range was sampled uniformly. This subsampled data was used to generate Figure 2A in the main text which shows that MA is highly constrained by molecular weight for molecules with masses between ~1-250 Daltons. The disparity between the number of known molecules and the number of possible molecules suggests that novel ways of exploring chemical space are required to identify important molecules and processes amongst the chemical noise. Exploring the structure of chemical space using molecular assembly could help identify processes that increase chemical complexity and generate molecules for material design, drug discovery and processes critical to artificial life.

Random decision tree model of molecular synthesis
Our goal is to use MA to distinguish biological artefacts from abiotic chemical products. To accomplish this, we must be able to determine a threshold MA for biological artefacts above which any reliable synthesis must be due to biological processes. To estimate a range of values for this threshold we explored the statistical properties of assembly pathways by computationally modelling the molecular assembly process as a random walk on directed trees.

Model Description and Parameterization
In this model the root of the tree corresponds to abiotically available precursors, while the number of leaves on the root correspond to the number of possible combinations of those precursors. Each node in the tree (besides the root) corresponds to molecules which could be synthesized from the abiotically available precursors. The depth of a given node (the shortest number of steps between it and the root) corresponds to the MA of that compound, with those precursors, see Figure 1B in the main text. We label the breadth of the tree at depth as .
Using this model, we can calculate the likelihood of different assembly pathways and determine a MA which corresponds to a sufficiently low likelihood of spontaneous formation. These likelihoods will depend on the properties of the decision tree used. For example, if we assume that the breadth of the tree is constant, , for all depths, and that for each node all leaves are equally likely then the likelihood of any molecule with a MA index of m, will be ( ) = − . These two assumptions, that the number of leaves (and therefore number of possible reactions) for any given molecule is constant and that all leaves are equally likely, are unrealistic for most chemical systems. Organic molecules can interact to form many different products. Which product forms depends on their structural and thermodynamic properties. Likelihood of those products can vary dramatically, often spanning several orders of magnitude. We relax these assumptions and compute bounds on the likelihood of the most likely assembly pathway in the decision tree. We introduce two parameters, described below, which control the properties of the tree. The first parameter h, controls to relative likelihood of possible transitions. The other parameter is a function, which controls the expected number of possible transitions at each point in the synthesis process, we test a linear function and an exponential function as two examples with different behaviour.

Results and Interpretation
Our goal is to determine the probability of the most likely pathway through a decision tree. To provide robust estimates in what follows we simulated 1 million decision trees under different assumptions, and for each tree we calculated the most likely pathway. Given this distribution of likelihoods of for each set of conditions, we report a conservatively high probability by using the 99 th percentile of the distribution.
We first relaxed the assumption that all leaves of a given node are equally likely. The simplest way to do this is to assume that the likelihood of each choice is drawn from a uniform random distribution between 0 and 1 and normalized such that the sum of the likelihoods is one. Under these conditions we find that the resulting probability does not change significantly from the case where all choices are equally likely, following the trend of ( ) = − . The uniform random distribution over choices implies that all future choices have a likelihood that are of the same order of magnitude, in this sense the distribution over choices is very homogenous. We next investigated more heterogenous distributions.
An obvious way to generate distributions over choices which vary by orders of magnitude is to first draw values, , from a uniform distribution between 0 and the value ℎ, and then assigning the likelihood of choices as 10 (where the likelihood of all choices is once again normalized to one). Example of distributions generated using this method are shown in Supplementary  Figure 9. By varying the value of ℎ we can investigate more or less heterogenous distributions. Under these conditions we find that the more heterogeneous the distribution the more likely the most likely path through the tree way. This is an unsurprising result; the effect of very heterogeneous distributions is to funnel the probability towards a limited subset of all possible paths at each step. Given these findings we continued to use the heterogenous distributions with ℎ = 4 for the remaining simulations, which we believe captures the appropriate degree of bias. We can further relax the assumption that the number of leaves at a given depth is constant. First, we considered the case where the number of choices at a given depth is a random integer from the uniform distribution between (2,26). This did not significantly alter the results of the previous simulations. In general, we expect that the number of choices to grow as a function of the depth of the node. We model this growth in two ways, with a linear function of slope ( ) and a power law with an exponent (α). For the linear case with a slope of 3 we find that molecules with a MA index of 30 have a likelihood of 1 in a mole (10 -23 ), while in the power law case with an exponent of 3, molecules of MA index 15 have a similar likelihood, which is shown the main text Figure 1C. Given these computational results we suggest that an appropriate threshold for MA is likely to be within the range of 15-30. Implicit in this model we have assumed that all precursors and intermediate structures are not only available but nearly infinite in concentration. This is a generous approximation. Relaxing this assumption would only serve to universally decrease the probabilities calculated here, which means that our model is overestimating the likelihood of the most likely pathway. We do not make any assumptions about the effect of atomic composition or bond stability in the model presented here. This means that the results generalize beyond known biochemistry to any assembly process that can be presented by the recursive joining of structures.

MS/MS and MA
In order to identify molecules which are produced via biological processes we need a method to experimentally determine the Molecular Assembly Index of any compound. We chose to investigate MS/MS as an analytical method to determine the MA of molecules based on the hypothesis that the fragmentation pattern of molecules should be closely related to the recursive decomposition used to determine the MA of the molecular graphs. Here we describe the analytical procedure we developed to experimentally explore the relationship between fragmentation and MA of molecules.

Ionization source
In the work presented here we chose to analyse all of our samples using an Electrospray Ion source for the mass spectrometry (ESI). There are of course different ionization methods that could affect the analysis. ESI was chosen in this case because it is a standard ionization method for most omics-based approaches and in the analysis of complex mixtures (REF). The most significant difference between ESI used here and other sources is that some molecules would be more readily ionized using different sources. Our analysis of over 100 compounds suggests that many molecules, particularly small organics ionize well enough for detection using ESI.
In the context of life detection experiments future work could be done to test different ionization methods based on mission objectives and predictions about the extra-terrestrial conditions from which samples will be collected.

Resolution
It is important we do not make any assumptions about what we are looking for in terms of elemental composition. We are just concerned with the intrinsic complexity of the molecule as determined by the fragmentation. Our technique does not need to identify any elementary composition -the MA maps as a function of the number of fragments. Therefore, the resolution is only set to ensure we can select a single nominal mass for fragmentation. This a key point of our approach meaning it can search for highly complex and unknown molecules.

Collision Energy
In routine bioanalysis of a particular analyte it would be proper to fully optimize the instrumentation for the analyte of interest. We investigated the effect of collision energies on the known molecules selected for our standard curve using both 35 and 45 kV. As expected at higher collisions there are more MS2 product ions, although across all the molecules tested the correlation between MA and MS2 product ions was similar. We decided to cautiously use 35 kV for the environmental because some molecules not able to fragment completely. In the environmental samples this would lead to false negatives which we have already accepted as part of our approach.

Direction Injection and Isomeric species
Using direction injection makes our experiments more directly comparable to potential space missions. However, it also raises the possibility of co-fragmentation of isomeric species potentially leading to overestimates of the number of fragments associated with any one ion. In order to test for this, we reanalysed one sample that is notoriously complex from an analytical perspective, the Murchison meteorite sample. We reanalysed this sample using LC-MS. However, because this analysis was done for a slightly different set of experiments, it was run with a higher mass range, to compare it to the results presented here only parent masses in the 300-500 m/z range were kept. The results of this analysis identified multiple isomeric parent masses that would have been observed simultaneously in the direct injection sample, however the final analysis demonstrated the LC-MS results were consistent with our direct injection results (see SI Figure 20).
The LCMS procedure is as follows: analyte was infused into the mass spectrometer using a Thermo Heated ESI (H-ESI) source with a +3.8kV voltage applied. The mass spectrometer was run with a DDA method, with a mass range of 150-1750 m/z. The 20 most intense ions were selected for fragmentation with dynamic exclusion with ions excluded for 30 seconds if detected twice within 30 seconds. The 30 second period was chosen to match the expected LC peak width. HCD fragmentation was set at a fixed 35%. Separation was achieved using an EC-Poroshell C18 reverse-phase column (dimensions: 150 mm x 4.6 mm, pore size: 2.7 µM). A gradient method was applied at a fixed total flow rate of 0.5 ml min -1 . Total run time was 55 minutes. The timetable for the gradient method is below: 0  80  20  15  60  40  25 5 95 52. 9 5 95 53.00 80 20

Data collection using Orbitrap
Supplementary Figure 12: Similar preparation procedures were applied to all the samples, slight variations in sample preparation were used due to the nature of different samples, these differences are documented in the text below. Prepared samples underwent identical analysis with a 15μl injection from an Advion Nanomate, followed by an MS1 full scan and MS2 fragmentation on a Thermo Fusion Lumos Orbitrap. The MS data was then processed and analysed.
Samples were analysed using a Thermo Fusion Lumos LTW-Orbitrap, which is capable of multiple rounds of fragmentation at a high resolution when ions are scanned in the Orbitrap after HCD fragmentation. By comparing data from fragmented molecules to the calculated MA of the associated molecular graph, we were able to uncover a correlation between MS fragmentation data and MA, allowing us to experimentally determine the MA of unknown environmental samples.
Molecules were solubilised in MeOH:H2O as much as possible and introduced to the Thermo Fusion Lumos Tribrid Orbitrap mass spectrometer via an Advion Nanomate. 15 μL of sample was injected onto an emitter with a +1.2 kV voltage applied. Samples were analysed for 6 mins, during which a Single Ion Monitoring (SIM) scan was performed for the specific molecule's m/z, followed by MS2 fragmentation. Both the SIM and MS2 fragment ions were analysed in the Orbitrap with HCD fragmentation set at 35% and 45%. The isolation window was set at 0.5 Da, the resolution of the SIM scan was 240000, and the MS2 resolution was 30000.
MS data was converted into mzML files using MS Convert (32). In-house scripts were then used to convert the mzML files into Json peak lists, with all MS1 peaks collected for each m/z over the 6 minutes analysis being merged. Spectra with maximum intensity under 50000 were discarded, and for those remaining all peaks within 0.01 Daltons were merged. All MS2 peaks not present in at least 25% of MS2 spectra from the corresponding MS1 parent were disregarded. Since our theoretical measure does not include hydrogen atoms, any peak that was +/-1.0 Dalton from another peak was merged, thus reducing the over count of ions which differ only by one hydrogen atom. The remaining MS2 peaks were counted, and this number was used with the calculated MA of the molecular graph associated with the MS1 peak to generate the correlation.
Environmental samples were each analysed under the same ionisation conditions, without any chromatography or other separation techniques. However, the mass spectrometer was run with a Data Dependent Acquisition (DDA) method which fragmented the 15 most intense ions, using a dynamic exclusion of 30 secs if the analyte was present twice in 10 secs, with a mass range of 300-500 m/z. Given the number of ions in the complex environmental samples, we also filtered peaks from the MS2 spectra that were below 10% of the highest peak in that spectrum. All other parameters were as above. In the analysis of the complex environmental samples it was noticed that despite the high resolution of the mass spectrometer, cofragmentation was resulting in excessively high numbers of MS2 peaks, by effectively merging different MS1 parent ions into the same MS2 spectra. To account for this phenomenon, the analysis method was amended. After counting the number of MS2 peaks for each selected parent ion, the algorithm checked the MS1 spectra for peaks within 0.5 Daltons of the parent ion. The total number of MS2 peaks was divided by the number of MS1 peaks found within 0.5 Daltons of the parent mass. This effectively accounts for the co-fragmentation patterns in that it divides the number of MS2 peaks across the total number of identified unique ions in the collision cell during the fragmentation. This method was used in all samples and was found not to affect the previous results for single ions.
In principle isomeric compounds with the same formulae could both enter the collision cell simultaneously, this is an important possibility because if they generated two distinct fragmentation spectra that were super imposed it would create more peaks than expected, potentially causing false positives using our method. Our analysis addresses this issue in two ways. First, the chance of two completely different structural isomers having abundances at a similar order of magnitude is very small, therefore the 10% intensity threshold will cut out any fragment ions from isomers which have an abundance approximately one order of magnitude less than the dominate ion. Second different structural isomers will have different bonding energies, and the degree of these differences will be critical. Molecules with very similar bonding energies are likely to share more structural motifs, which would generate the same fragments. The fragmentation of two (or more) molecules with very different bonding energies will occur unreliably due to a random distribution of the energy between different ions. By enforcing peaks occur in 25% or more of the scans we remove those fragments which occur unreliably.
Supplementary Figure 13: Correlation of known molecular assembly indices with the number of peaks observed in MS2. Molecules were analysed multiple times and the number of MS2 peaks was averaged for each unique molecule. Results were plotted and a linear relationship was fitted. Top shows a combined plot with all molecules. Bottom left shows the plot and correlation for only peptides, and the bottom right shows the plot and correlation for only small organics.
Supplementary Figure 14: Mass Spec Thresholding Procedure for MS1 and MS2 data using Data Dependent Acquisition. This workflow shows the different thresholding procedures used on the environmental samples. In the MS1 spectra the top 15 most intense peaks were selected for fragmentation. Those peaks were checked for co-fragmentation and the number of ions within the selected window was recorded. For each selected MS1 parent ion the MS2 spectra were filtered to remove bad fragmentation (using the maximum intensity threshold), peaks were merged if they were within 0.01 Dalton of one another, and any MS2 peaks which occurred in less than 25% of spectra were removed.

Sample Prep Details
Having established our ability to experimentally determine the MA of molecules using tandem MS, we next sought to directly test our hypothesis that high MA molecules can only be produced by living systems. To do that we prepared and analyzed several mixtures, including those sourced from abiotic, live biological, and dead/degraded biological sources. Each sample was prepared with a similar procedure with the only significant differences arising due to the different nature of the samples. The details regarding the preparation of those samples are listed below.

Yeast
A solution of sucrose was added to 1g of commercially available baker's yeast and allowed to activate at room temperature (18 o C) overnight. On observation of carbon dioxide bubbles, the yeast was centrifuged at 15115 x g for 10 mins. The supernatant was discarded, and the pellet was split into 4 samples. One sample was labelled native and 1 mL of methanol was added followed by 30 mins sonication. The other three samples were analysed by Thermo gravimetric analysis (TGA) at three different temperatures 200 o C, 400 o C and 600 o C (Supplementary Figure 15). The charred samples were then extracted by sonicating for 30 mins in methanol, and all four samples were filtered prior to mass spec analysis as described in Supplementary  Bacterial culture was then centrifuged for 10 minutes at 4°C to form a cells pellet which was washed twice with 50-100 mL of ice-cold water. After that, the wet pellet was dissolved in water to make a final concentration of ca. 1 g/mL. Mechanical cell lysis using bead beating method was used to avoid any chemical or enzymatic interference. In a beat beating tube, 500 µL of cell solution was mixed with 500 µL of water and was run on the beat beater machine for 30 seconds followed by incubation on ice for another 30 seconds. This process was repeated 10 times. Samples were centrifuged at 4 °C for 3 min before removing the supernatant and centrifuging the supernatant again for 60 minutes. The resulted supernatant was collected and stored at −80 °C for further analysis, we were able to use a fraction of this lysate originally intended for other studies.

Urinary Peptides
Pooled human urine was mixed 50:50 with 2M urea, 10mM NH4OH and 0.02% sodium dodecyl sulfate. Samples were filtered with Centristat, 20 kDa cutoff, (Satorius, Göttingen, Germany). The filtrate was desalted in a PD-10 column (GE Healthcare Bio Sciences, Uppsala, Sweden). The processed urine was dried and stored at 4 °C before use. The Urinary Peptides sample was reconstituted in 500 µL H2O before injection into the mass spectrometer.

Rock and Soil Samples
Coal, serpentine, sandstone, limestone, granite, quartz and clay were separately crushed in a rock crusher and sieved to <0.25 mm. Rocks were supplied by Richard Tayler Minerals (Surrey UK). 1Mg of rock dust was submerged in 1mL of MeOH overnight at room temperature, centrifuged at 15115 x g for 10 mins and the resultant supernatant removed and filtered through Wattman paper. The eluent was loaded onto a 96 well plate and analysed by mass spectrometry.

Beer
Home brewed beer courtesy of Dr James Ward Taylor was mixed 50:50 with MeOH. Samples were then loaded onto a 96 well plate and injected into the mass spectrometer.

Dipeptides
1 mg of Alanine/Arginine (Dipeptide 1) and Glycine/Arginine (Dipeptide 2) were weighed and reconstituted in equal parts MeOH:H2O. These dipeptides were loaded onto a 96 well plate and injected into the mass spectrometer.

Formose Reaction Mixtures
The Formose Reaction (33) was carried out by adding Formaldehyde (0.5 mL), Glycolaldehyde (0.0126 g), Water (4.5 mL) and Calcium Hydroxide (0.0705g) to a 22mL borosilicate glass vial. The mixture was stirred at 1200rpm with a magnetic stirrer and heated at 50°C for 48 hours. Three types of experiments were done. 1) A typical one-pot control and 2) the reaction in the presence of mineral samples (see ref (33)).

Miller-Urey Spark-Discharge mixture
A typical Miller-Urey Spark-Discharge experiment (34) was carried out in the following fashion: 400mL of HPLC water was placed in the reaction flask, which was degassed, evacuated and then pressurized to 1 atm with a gas mixture of 40% methane, 40% ammonia and 20% hydrogen. The reactor was heated and a 24 kV spark discharge was turned applied, in a 10 seconds alternating ("on" -"off") duty-cycle. The experiment was continued for 7 days, after which the system was flushed with nitrogen gas and the product mixture was removed.

Whisky
Whisky was kindly donated from Group members and The Jar Troon Whisky Specialists. . A sample of a 10 year old Ardberg and a 25 year old Glengoyne were diluted 1:50 with LC-MS grade H2O before loaded onto a 96 well plate and injected and analysed using the same methods as previously described.

Taxol
Taxol (Paclitaxel) was purchased from Sigma (Cat :T7402, Lot#MKBZ4464V) and solubilised in MeOH to a concentration of 1.5mg/ml. This concentration was injected into the mass spectrometer.

Carbonaceous chondrite (Murchison Extract)
In order to test the MA of an extra-terrestrial sample we used a portion of the Murchison meteorite originally from the Chicago Field Museum. This meteorite had been kept stored in a 600 mL parafilm-covered glass beaker and sealed for an unknown period of time (many years) inside a glass desiccator containing both P4O10 and CaCl2 desiccants at the University of Chicago. Recent analyses (35) have revealed contamination such that these results are not pristine, however this sample offers a unique natural sample for the assessment of novel analytical procedures.
For the purposes of method development a ~4 g portion of the Chicago Murchison sample was powdered in an ashed ceramic mortar and then extracted first with methanol three times (10 mL each) and then by dichloromethane (two times, 10 mL each) by ultrasonication for 30 min at ambient temperature (36). To remove excess sulphur species, the solvent extracts were combined over copper pellets that had been freshly treated with 0.1M HCl (to remove CuO) then washed with water, methanol, and dichloromethane. The total volume was gently reduced by 80% by a spinning band column at 40º C before OrbiTrap analysis by direct infusion.

Marine Sediment
A standard reference material (SRM 1941b) for Organics in Marine Sediment was obtained from the National Institute for Standards and Technology (U.S. Department of Commerce, Gaithersbury, MD). These sediments were collected from the Chesapeake bay (39º 12.3'N and 76º 31.4'W) and freeze-dried, sieved to 150 µm, homogenized and then radiation sterilized by 60Co before dispersal. This SRM is intended for evaluation of methods for the analysis of polycylic aromatic hydrocarbons, polychlorinated biphenyl congeners and chlorinated pesticides, among other similar contaminants. This SRM has been thoroughly characterized with analyses published widely ((37, 38) among others). Extraction of this sample followed as above (SI-6.10) except that solvent reduction was achieved by gentle nitrogen blow-down at ambient temperature.

Holocene Paleomat
This sample was collected in 2016 from the upshore sediments of Lake Vanda (77º 31.2'S, 161º 38.3'E) in Antarctica (39). This paleomat was excavated from beneath ~5cm of sediment using ashed copper utensils and collected into sterile cryotubes and placed directly into a charged liquid nitrogen cryoshipper. Paleomats of this type are thin, desiccated remnants of ancient benthic microbial mats. The paleomat record in the Lake Vanda valley date to millennial time-scales spanning ~30,000 years and represent one of the few sources of organic carbon in local soils. Extraction of 2.0 mg of this sample followed as above (SI-6.10) except that the sulphur capture step was omitted.

Mid-Miocene Lakebed Sediment
This sample was collected in 2016 from a small basin near Mount Boreas in the western Olympus Range (77º 28.42'S, 161º 10.2'E) in Antarctica (40). These unconsolidated, planar lacustrine beds contain mixed fossiliferous mat and individual samples were excavated from fine sands just beneath the surface of a rocky terrain. Sediments contained obvious fragments of fossilized moss material and have also been found to contain benthic diatoms and ostracodes as well as biomarkers from typical lacustrine microbial communities. Samples were collected using ashed copper utensils and collected into sterile cryotubes and placed directly into a charged liquid nitrogen cryoshipper. 40 Ar/ 39 Ar dating of in situ ashfall layers indicate an age 14.07 +/-0.05 Ma. Diatom stratigraphy indicates that this lake persisted for perhaps thousands of years before burial by washed material from nearby Mt Boreas. Extraction of 2.0 mg of this sample followed as above (SI-6.10) except that the sulphur capture step was omitted.
6.14 Seawater Extract 100 ml of deep-sea water was evaporated in a round bottom flask. The remaining 'salt crust' was resolubilized in 5ml of 100% methanol. This was left on the bench for 2 hours to settle, after which 250ul of the supernatant was centrifuged at 15115 x g for 15 mins. The supernatant was then collected and 20ul was injected on to a Vanquish UPLC with the Thermo Fusion Lumos connected.

Aeromonas veronii (External Data)
Mass Spectra from a sample of Aeromonas veronii data was downloaded from Metabolights (41). This data was analysed on a Triple TOF 6600 (Sciex). The downloaded .wiff files were converted using MSConvert to mzML, where an index was written, and 64-bit binary encoding precision was selected. In addition, charged states 1-4 was included and MS1 and MS2 levels were selected. We then processed the data through our analysis pipeline, with the addition step of only selecting ions in the 300-500 m/z mass range. This data was collected via LC-MS/MS and therefore is slightly different than the data collected by our instrument. Accordingly, we've only selected the top 15 highest MA peaks from this data to include in the analysis shown in the main text. All the data is shown below.
Supplementary Figure 16: All Samples including all the data from the Aeromonas Veronii Sample that was collected from an online repository (41) and the seawater which was run with a column attached.
Supplementary    Due to the fact that this was a preliminary investigation into the use of MS spectra with Neural Networks to predict MA, the test dataset was also used for validation. In general, this is bad practice as there is significant risk of selecting hyperparameters that exclusively fit the test dataset, but for initial investigation into whether the CNN would be able to predict with any degree of accuracy, leaving out the validation set to allow more data to be used for training seemed the best option. The test/validation dataset consisted of 624 images, with 78 images per MA. There was insufficient data to train the network at a higher MA range than this. Training was carried out until the test/validation MAE stopped decreasing, 67 epochs in total. The training dataset was updated every three epochs with new MS spectra not contained in the test/validation dataset.

Inferring MA with Convolutional Neural Networks
The confusion matrix and error histogram in Supplementary Figure 29 show that the MS CNN managed to predict MA with some degree of accuracy. The maximum absolute error was 6, with a maximum possible absolute error of 7, and the test percentage of perfect guesses for the model was 42.8%. Other models based on SMILES strings and images of molecular structures performed somewhat better, however the MS based model was limited in accuracy and MA range by the limited amount of training data available, and improved performance could be possible with more training data.