Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics

The metabolome includes not just known but also unknown metabolites; however, metabolite annotation remains the bottleneck in untargeted metabolomics. Ion mobility – mass spectrometry (IM-MS) has emerged as a promising technology by providing multi-dimensional characterizations of metabolites. Here, we curate an ion mobility CCS atlas, namely AllCCS, and develop an integrated strategy for metabolite annotation using known or unknown chemical structures. The AllCCS atlas covers vast chemical structures with >5000 experimental CCS records and ~12 million calculated CCS values for >1.6 million small molecules. We demonstrate the high accuracy and wide applicability of AllCCS with medium relative errors of 0.5–2% for a broad spectrum of small molecules. AllCCS combined with in silico MS/MS spectra facilitates multi-dimensional match and substantially improves the accuracy and coverage of both known and unknown metabolite annotation from biological samples. Together, AllCCS is a versatile resource that enables confident metabolite annotation, revealing comprehensive chemical and metabolic insights towards biological processes.

The manuscript by Zhou and colleagues describes a software tool for the large scale deployment of ion mobility collision cross-section data for metabolite identification of both known and unknown metabolites. While a number of tools have been created previously, including by the senior author, to search databases of known standards and also predict CCS values for metabolites without CCS measures, AllCCS dramatically extends the use of CCS values and ion mobility for metabolite identification and annotation by unifying a number of experimental CCS values to produce consensus values and predicting CCS values for an impressive database of putative metabolites. The authors demonstrate the use of AllCCS for a number of biological applications and I was impressed with the identification of novel metabolites.
My major comment is that if I have understood AllCCS it is available through a webserver maintained by the group and people need to register for access. While the authors make some data available a lot of what is generated is behind the webserver interface. What are the plans to making this available for the wider community? The database of putative metabolites looks particularly useful. Also, if the tools are made freely available to the community they can be greatly improved in terms of versatility -as an example I think about the developments of xcms by the community. Have the authors considered this? Also, how does the server cope with large datasets as can be generated by ion mobility data?
As a follow up comment while its impressive how well CCS values can be estimated -in many cases its less than 2% -can the authors discuss examples where this is not enough. Some discussion of what can be separated in terms of structural isomers would help. I am thinking of metabolites such as glucose-6-P, glucose-1-P and fructose-6-P which are poorly separated by both chromatography and ion mobility. Some examples of where an annotation can be made but not an identification would be useful.
In addition, I do have some specific comments.
Abstract, line 17. The first sentence should be "the metabolome" rather than just metabolome. I think the paper would benefit from a careful read to fix some of the grammar errors that creep in.
Introduction, line 32. This first sentence is also odd and should be re-written. Similarly for the sentences on lines 54, 58 and 66.
Results. Line 94 -should be laboratories.
Line 98 "As a result, a total of 3,539 unified CCS values were calculated for 2,193 compounds…" I think from what the authors say elsewhere the discrepancy in numbers is down to ionisation mode rather than different conformers -its probably worth stating this for the reader and the authors are largely annotating one conformer.
Line 283. Is KEGG the best choice for metabolic reconstructions? Is RECON a better alternative for human metabolism?
A minor point -Student's t-test and student's t-test are written. I think the correct form is with a capital S but needs to be consistent one way or the other depending on whether you believe Student was a genuine author or not! "All other data supporting the findings of this study are available from the corresponding author on reasonable 602 request." Should the data be made openly available?
Something is wrong with reference 9.
Reviewer #3 (Remarks to the Author): The manuscript by Zhu et al. describes a curated atlas of ion mobility collision cross sections (CCS). These collision cross sections are being increasingly used for the purpose of metabolite annotation as a complementary parameter by which unknown molecules, as particularly metabolites in untargeted metabolomics studies, can be better annotated. Although the number of databases with collision cross sections is growing steadily both in number and in size, the dearth of comprehensive CCS databases has prompted many scientists to also look into predicting such CCS values using machine learning approaches. The authors of this manuscript have, in the past, presented several aspects of the work covered in this new manuscript in a number of articles. These include the following: Anal. Chem. 2016, 88, 11084−11091 (MetCCS);Anal. Chem. 2017, 89, 9559−9566 (LipidCCS);Bioinformatics, 33(14), 2017, 2235-2237 (MetCCS) etc. Similar work has also been published by other teams, including: Anal. Chem. 2020Chem. , 92, 1720Chem. −1729Anal. Chem. 2019, 91, 5191−5199;Anal. Chem. 2017, 89, 6583−6589;Chem. Commun., 2017,53, 7624-7627; Journal of Chromatography A, 1542 (2018) 82-88; Analytica Chimica Acta 924 (2016) 68e76. In that sense, and despite the interesting nature of what is being described in the manuscript, the work is seen as incremental, but not transformative, so publication in Nature Communications is not seen as a good option, and a more specialized Journal such as Analytical Chemistry or The Analyst is recommended.

Response to the reviewers:
The authors would like to thank the reviewers for the helpful comments. We feel these comments have strengthened the manuscript considerably.

Reviewer #1:
Remark to the Author: "This paper described the creation and validation of an atlas of CCS values.
The atlas is created using a combination of experimental CCS (merged from multiple databases) and in-silico predicted CCS. The authors show that they are able to predict CCS better than current systems, and that combining predicted CCS with other data (m/z, predicted MS/MS) improved candidate ranking. The atlas looks to be a resource that has practical utility and the authors do a good job showing that it performs well." Ans: Thanks a lot for the reviewer's positive comments towards publication.
Comment #1: "Overall I enjoyed reading this paper. I think it does a good job in setting out the system, evaluating it, and describing the benefits it brings. There are quite a few places in the text though in which more thorough proof-reading is required (too numerous to list here)." Ans: Thanks a lot for the reviewer's comment. We have carefully checked and re-edited the manuscript and supplementary information.   Table 12). The SVR prediction demonstrated better performances than MLR method with validation sets 1 and 2. The results indicated that the optimized parameters in SVR prediction make the model towards a linear regression, but has better performances. The term "radial basis" has been corrected. The results and discussion have been added in the revised manuscript.   FoodDB will also be deployed through the collaboration with the HMDB bioinformatics team;

Supplementary
(2) we are collaborating with Dr. Hiroshi Tsugawa to integrate AllCCS into data processing tool MS-DIAL 4. This allows to accelerate the workflow from raw data processing to compound identification. The related work has been posted on bioRxiv (https://doi.org/10.1101/2020.02.11.944900), and accepted by Nature Biotechnology (online on June 15, 2020 by the journal).
(3) we are also collaborating with instrument vendors (currently Agilent and Bruker) to deploy AllCCS into vendors' software. This will allow the easy access to AllCCS for common users.
Finally, as requested by the reviewer, the CCS values for putative metabolites generated from the in-silico reactions have been deposited in AllCCS (named as ExtDB).
The related discussion has been added in the revised manuscript. Ans: We agree with the reviewer's comment. As we mentioned in the response to Comment #1, we are collaborating and integrating AllCCS into data processing tools such as MS-DIAL 4 and vendors' software tools. This allows to accelerate the workflow from raw data processing to compound identification. For sure, developers could also integrate AllCCS with other data processing software tools such as XCMS and MZmine. The related discussion has been added in the revised manuscript.
Comment #3: "Also, how does the server cope with large datasets as can be generated by ion mobility data?" Ans: Thanks a lot for the reviewer's comment. Currently, AllCCS does not directly process raw IM-MS data files. Instead, we recommend the users processing raw IM-MS data files using MS-DIAL 4, or vendors' software tools (e.g., Agilent Mass Profiler and Waters Progenesis QI). Then, the users could download AllCCS and perform the annotation within these tools. In addition, users can also upload the feature table into the AllCCS server for metabolite annotation. The related discussion has been added in the revised manuscript. Ans: We agree with reviewer's comment. Although the CCS prediction has made effective improvements to ~2% prediction errors, the identification of metabolite isomers is still a challenge due to the limit resolution of ion mobility (IM) separation. In the revised manuscript, we discussed the IM separation of 4 metabolite isomers, including glucose-6-phosphate (G6P), glucose-1-phosphate (G1P), fructose-6-phosphate (F6P) and fructose-1-phosphate (F6P) (see Supplementary Figure 18).
Four metabolite isomers were analyzed using Agilent DTIM-MS with a IM resolution of 40-60. The results demonstrated that metabolite isomers were poorly separated with CCS differences ranging from 0.6% to 3.4%. Therefore, it is impossible to differentiate the isomer pairs of G6P/G1P and  Ans: Thanks a lot for the reviewer's comment. We agreed with the reviewer that the dearth of comprehensive CCS databases facilitated the development of machine-learning based CCS predictions from our group and others. Compared to other tools, AllCCS has great values and advantages in the following aspects: (1) AllCCS is the most comprehensive atlas to embrace both experimental and predicted CCS values, and provides the largest platform to store, standardize, and search available CCS resources; (2) The improved CCS prediction algorithm in AllCCS outperformed other existing tools in terms of the accuracy, coverage and applicability; (3) The representative structure similarity (RSS) developed in AllCCS could indicate the errors of predicted CCS values, which has not been achieved in other tools; (4) We systematically demonstrated that AllCCS enabled the multi-dimensional match and substantially improved the accuracy of metabolite annotation; (5) We demonstrated the use of AllCCS to discover unknown metabolites and reveal the additional chemical and metabolic insights towards biological processes.
Overall, AllCCS has provided a high-quality and unified CCS database for the IM-MS community, and further enabled both known and unknown metabolite annotation in untargeted metabolomics.
We believe our work is significant and suitable for the broad interests of scientists in the field.