Untargeted high-resolution paired mass distance data mining for retrieving general chemical relationships

Untargeted metabolomics analysis captures chemical reactions among small molecules. Common mass spectrometry-based metabolomics workflows first identify the small molecules significantly associated with the outcome of interest, then begin exploring their biochemical relationships to understand biological fate or impact. We suggest an alternative by which general chemical relationships including abiotic reactions can be directly retrieved through untargeted high-resolution paired mass distance (PMD) analysis without a priori knowledge of the identities of participating compounds. PMDs calculated from the mass spectrometry data are linked to chemical reactions obtained via data mining of small molecule and reaction databases, i.e. ‘PMD-based reactomics’. We demonstrate applications of PMD-based reactomics including PMD network analysis, source appointment of unknown compounds, and biomarker reaction discovery as complements to compound discovery analyses used in traditional untargeted workflows. An R implementation of reactomics analysis and the reaction/PMD databases is available as the pmd package.


Response to Reviewers' Comments
We thank the reviewers for their insightful comments. We have substantially revised the manuscript on their basis, and have detailed the changes made in response to each comment below. We have also adjusted the manuscript and supplementary information structure in line with the requirement for Communications Chemistry.

Reviewer #1:
The manuscript "Reactomics: using mass spectrometry as a reaction detector" by Miao Yu and Lauren Petrick describes the concept of mining specific m/z related to biotransformations of specific molecules in high resolution full mass scan data. As the authors point out, the knowledge of these existing relationships is not novel, but they were able to conceptualize it well and apply it by compiling bioreactions in KEGG, by re-analyzing data and by constructing networks using their proposed workflow. This research is a valuable and important approach considering that the current methods of matching ions at databases are deprived of any chemical intelligence.
Even though the authors offer a R package which is accessible for researchers with bioinformatics skills, overall the manuscript is missing details that would allow colleagues to use it as a reference and use this approach to analyze their data. To improve the communication and understanding of this manuscript here are comments for the authors to consider.
1. The supplementary files should include the data mentioned in the manuscript and used for the figures and calculations. 2. There is a whole field that uses mass spectrometry to monitor medicinal chemistry reactions for synthetic purposes. It seems to me that it would be more appropriate to use terms such as bioareactions and bioreactome. Even though reactions in bulk or inside organisms share mechanisms, the proposed approach and applications is focused on living organisms.

Reply:
We agree that reactomics can be used in many fields. As such, we have clarified the name to 'PMD-based reactomics' to reflect that its use is not limited to biological applications. For example, an important application for PMD-based reactomics is in the field of environmental sciences. Environmental chemical reactions can be either biologically related (e.g. methylation or hydroxylation) or driven by environmental factors such as photochemical or pyrolysis reactions. Unlike KEGG, databases for those reactions are not readily available, which limits our ability to demonstrate the potential of reactomics in environmental processes. However, the concept can be readily applied to all possible chemical reactions as long as they can be detected using mass spectrometry. We have added or modified related sentences to clarify this in the manuscript including at the end of introduction (lines 86-88), in the discussion section of TBBPA metabolites (lines 118-120), as well as in the abstract.
3. Line 72 "However, these calculations of PMD were used to identify compounds or pathways in ultimately facilitate interpretations…" Did the authors mean "…and ultimately facilitate interpretations …"?
Reply: Thanks, this has been corrected (line 72).
4. Matrix between lines 118 and 119: please add the monoisotopic masses of the compounds mentioned so that the reader can follow the matrix calculations.

Reply: We have added the monoisotopic masses to matrix 2 [M2]
5. Lines 126-127: "In this case, only one value will be kept as reaction PMD.". Isn't it "reaction minimum PMD"?
Reply: Yes, this was clarified as suggested (line 251).
6. Lines 130-131: "Here, we describe it as an elemental composition instead of chemical formula, because it also describes the gain and loss of elements, …" Adding this information to the matrix (lines 118 and 119) as a supplementary materials would be useful for understanding.
Reply: This has been added to the supplementary information (Supplementary Notes,MS1) 7. Overall, figure captions are too short and uncomplete. They should be self-explaining.

Reply:
We agree, and have gone through and added additional detail to each figure caption.

We plotted the number of unique chemical formulas as a function of PMD frequency cut-off for endogenous (red), exogenous (blue), and both endogenous and exogenous compounds (black) from the 617 T3DB compounds. As can be seen, the endogenous compounds show a different increasing rate of unique formula with increasing PMD cut-off (N) than the exogenous compounds. In order to select a cut-off, a balance must be met between too large of a cut-off, where all compounds will be connected to one network without separation and result in the inclusion of random PMDs among compounds, or too small of a cut-off, where the connections among compounds with chemical reaction meanings are lost. In the figure below, after the top 13 PMDs, the number of unique formulas for endogenous compounds begin to plateau.
Similarly, the first plateau for the number of unique formulas for exogenous compounds occurs after the top 12 PMDs. Combining the endogenous and exogenous compounds together, the total unique formula numbers start flattening after 12 PMDs. These first plateaus suggest a cut-off estimate of ~10 PMDs is appropriate, which is also consistent with the top 10 KEGG PMDs used in the related discussion.
We have also supplied the code that allows generation of the network from 1 up to 50 top PMDs for independent validation (see line 120-141 in RSI.r).
As recommended, the new caption for figure 2 has been changed to:

Figure 2. PMD network of endogenous and exogenous compounds from T3DB. PMD analysis was performed on all 255 endogenous compounds (with 223 unique chemical formula) and 705
exogenous carcinogens (with 394 unique chemical formula) from the T3DB database. PMD network analysis was then performed using the top 10 frequency PMDs from the 617 unique formulas. 235 compounds with unique chemical formulas did not have linkages, and were removed. The remaining 157 endogenous (red squares and circles) and 95 exogenous compounds (blue squares and triangles) were used to generate the network. 8. Figure 3: How the compounds have been selected? By pathway? Why limited to 10 top frequent PMDs? Shouldn't it be better to consider most meaningful PMDs related to such pathways?

Reply:
The compounds were selected based on their sources (endogenous or exogenous) as examples. However, any compounds in the KEGG database could be selected, and we have supplied the code to explore these other compounds in the supplementary materials (see line 291-319 in RSI.r). We limited this analysis to the top 10 PMDs to make it consistent with figure 2 and related discussion, but a different cut-off could be larger or smaller depending on the user preferences. While in some specific cases it may be more meaningful to consider known PMDs, the value of PMD-based reactomics is in untargeted analysis, for which pre-defining the active reactions in a certain system is very challenging. Therefore, we allowed the data (either samples or databases) to determine the enriched PMDs and used a frequency cut-off for selection.
The new caption will be:  5-cholestene (B), caffeine (C), and bromophenol (D), networks were generated using the 10 top frequency PMDs identified in KEGG. For each network, the selected compound is indicated with a red node. The associated metabolites are indicated by a black node, with edges colored based on the PMD of the relationship between nodes. Double edges from the same pairs or self-looped edges indicate isomers-related relationships.
9. In line 267 the authors refer to "quantitative PMD analysis" and refer to Figure 4. In analytical chemistry quantitative means absolute amounts, but it seems the meaning here is the quantities related to the distance among m/z. Would the authors please clarify that.
Reply: Thank you for this comment. Indeed, here the quantitative PMD analysis actually uses the total intensity of all the paired ions with the same PMD. However, due to the change of response factor between substitute and product compounds, we only selected the pairs with consistent intensity ratios (RSD%<30%) across all of the samples for quantitative PMD. We agree that 'relative quantitation' more accurately reflects this analysis, and have changed the language accordingly throughout the main text.
Furthermore, the new caption for Figure 4 will be: Figure 4. PMD differential analysis (t-test, p-value < 0.05) identifies PMD 2.02 Da as a potential biomarker reaction for lung cancer. Data from the public MetaboLights repository (MTBLS28 dataset) were analyzed with relative quantitative PMD analysis. PMD pairs of 2.02 Da with consistent intensity ratios (RSD% < 30%) across all samples were selected. Then, the total intensity of ions with a PMD 2.02 Da was calculated in each sample and compared between cases and controls.
10. Figure 4: The authors state that there is a significant difference between case and control, but this is not clear when observing the amount of variation and distribution of the data points. So more explanation on how the data was analyzed and the data itself (as supplementary file) is needed. The caption is not informative at all. What about the other reactions in this dataset, no changes were observed?

Reply:
We identified 64 high frequency PMDs in the lung cancer data using PMD analysis, and have added these raw peaks list to the supplementary method (section: Code and data for the whole study). Further, the code to identify and screen the high frequency PMDs is also included.
In addition to PMD 2.02 Da, we saw several other PMDs with statistical differences between cases and controls including: PMD 0.04 Da,21.98 Da,26.02 Da,1.00 Da,0 Da,and 2.01 Da. A limitation of using data from a public database is that PMDs could not be validated. Since 21.98 Da could be a sodium adduct, and the other PMDs might come from low accuracy of MS (e.g.