Collected mass spectrometry data on monoterpene indole alkaloids from natural product chemistry research

This Data Descriptor announces the submission to public repositories of the monoterpene indole alkaloid database (MIADB), a cumulative collection of 172 tandem mass spectrometry (MS/MS) spectra from multiple research projects conducted in eight natural product chemistry laboratories since the 1960s. All data have been annotated and organized to promote reuse by the community. Being a unique collection of these complex natural products, these data can be used to guide the dereplication and targeting of new related monoterpene indole alkaloids within complex mixtures when applying computer-based approaches, such as molecular networking. Each spectrum has its own accession number from CCMSLIB00004679916 to CCMSLIB00004680087 on the GNPS. The MIADB is available for download from MetaboLights under the identifier: MTBLS142 (https://www.ebi.ac.uk/metabolights/MTBLS142).

till the mid of the last century, quinine; the antihypertensive reserpine, and vincristine and vinblastine, which are used directly or as derivatives for the treatment of several cancer types. Recently, much effort was directed toward understanding and manipulating the underlying biosynthetic pathways of MIAs in order to engineer them in microorganisms to allow industrial production of medicinally relevant compounds [3][4][5] . Although a large amount of knowledge has been accumulated concerning the early steps [6][7][8] and the assembly of key intermediates, many questions are still unanswered, and the discovery of new members of this family may illuminate unexpected enzymes involved in the biosynthesis of this intriguing group of natural products.
As part of our continuing interest in MIA chemistry [9][10][11][12] , we developed a streamlined molecular networking 13 dereplication pipeline based on the implementation of an in-house MS/MS database, constituted of a cumulative collection of MIAs 14 . In order to enrich this database, seven prominent practitioners from the global natural products research community shared their historical collections, leading to the construction of the largest MS/MS dataset of MIAs to date, that we named: Monoterpene Indole Alkaloids DataBase (MIADB) (Fig. 1). The MIADB contains MS/MS data of 172 standard compounds, comprising 128 monoindoles and 44 bisindoles (these compounds are presented in Supplementary Table 1) and covers more than 70% of the known (30/42) MIA skeletons. The information that can be drawn from this dataset is valuable for the scientific community that envisages the isolation of new MIAs.
The purpose of this Data Descriptor is to announce the deposition of the MIADB on the Global Natural Product Social Molecular Networking (GNPS 15 ) and MetaboLights 16 . Each spectrum of the MIADB has its own accession number from CCMSLIB00004679916 to CCMSLIB00004680087 on GNPS (accessed via: https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp). The spectral collection is also is available for download from MetaboLights under the identifier: MTBLS142 17 .

Methods
Sample preparation. Each of the collected MIA was diluted to a concentration of 1 mg/mL using HPLC- Database constitution. The analysis of each of these substances resulted in 172 files with the standard.d format (Agilent standard data-format). A list of individual compounds for each sample was generated from an Auto MS/MS data mining process implemented in MassHunter ® software on every single file. Averaged as well as monocollisional energy MS/MS spectra were generated from the three retained collision energies (30, 50, and 70 eV). Within this list, the molecular formula (as well as the exact mass) of the expected compound (in its charged state) was identified. Then, depuration of the other features was carried out. Finally, each spectrum was converted into the.mgf (Mascot Generic Format) using the export tool of the MassHunter ® software.

Data Records
All data described in this article have been uploaded to GNPS and MetaboLights. Each spectrum of the 172 compounds of the MIADB has its own accession number from CCMSLIB00004679916 to CCMSLIB00004680087 on the Global Natural Product Social Molecular Networking (GNPS) (accessed via: https://gnps.ucsd.edu/ ProteoSAFe/static/gnps-splash.jsp). The spectral collection in its two versions (i.e. averaged and separate collision energy MS/MS spectra at 30, 50, and 70 eV) is available for download from MetaboLights under the identifier:   Full molecular network of the profiled compounds from a methanol extract of C. roseus leaves annotated by the MIADB. The cosine similarity score cutoff for the molecular network was set at 0.6, the parent ion mass tolerance at 0.02, the fragment ion mass tolerance at 0.02, the score library threshold at 0.6 and the minimum matched peaks at 6. The cosine similarity score are depicted on the edges. www.nature.com/scientificdata www.nature.com/scientificdata/ analyses were carried out by the various collaborators having contributed to the establishment of the database. The obtained mass spectra were individually inspected to verify the occurrence of either the protonated molecular or molecular ion as the precursor mass.

Molecular networking-based dereplication of Catharanthus roseus methanol extract. Molecular
networking-based dereplication using MIADB-uploaded GNPS libraries was attempted on the methanol extract of Catharanthus roseus, the MIAs content of which was thoroughly studied. Accordingly, more than 130 different compounds were reported from the different tissues of the plant 18 . In the displayed network, the experimental data of C. roseus methanol extract are depicted as green rectangles and nodes representing a consensus of experimental data and database records (i.e., MIADB-uploaded in the GNPS libraries) are displayed as red rectangles (Fig. 2). As expected, molecular networking of the C. roseus leaves methanol extract allowed dereplication of previously known metabolites within this plant including: tabersonine, catharanthine, vindolinine, perivine, geissoschizine, pericyclivine, serpentine, raubasine, and akuammigine (Table 1). All the dereplicated compounds were assigned a level of confidence 1 according to Schymanski et al. 19 based on HMRS, MS/MS and retention time matching, except for geissoschizine, serpentine; and alloyohimbine. The latter were attributed a level of confidence of 2, due to a delta of retention time (RT) superior to 1.5 min. The molecular networking-based dereplication provided a  www.nature.com/scientificdata www.nature.com/scientificdata/ comprehensive coverage of C. roseus alkaloids by regards to the available standards, despite the noticeable lack of a vinblastine hit. This missing observation is likely due to the vinblastine concentration that is known to be very low in the plant (ranging from 0.0003% to 0.001% w/w dry weight) 20 . Conversely, some unexpected matches could also be evidenced throughout the obtained dereplication: burnamine and vobasine. Although none of these were previously described in C. roseus, both these structural assignments can be deemed reasonable based on biosynthetic considerations. Being an akuammiline-derived MIA, such as akuammine 21 and the monomer precursors of the bisindoles vingramine and methylvingramine 22 that have been reported to occur in C. roseus, the detection of burnamine is not unexpected. Likewise, the co-dereplication in the depicted molecular network of the formerly described vobasane-type perivine supports the identification of vobasine within this plant. Such examples emphasize the dereplicative interest of MIADB especially on such a deeply dug plant model. Prior to its GNPS upload, i.e., as an in-house database, the ability of the MIADB to pinpoint tentatively new MIAs was demonstrated through the streamlined isolation of geissolaevine along with its O-methylether derivative and 3′,4′,5′,6′-tetrahydr ogeissospermine from the formerly vastly studied Geissospermum leave (Vell.) Miers (Apocynaceae) 14 . Altogether, the currently garnered results support the valuable contribution of MIADB either for the straightforward identification of monoterpene indole alkaloids or to highlight putative structural novelty among this privileged structural class. The topology of the obtained network also reveals that a further extent of information could yet be accessed from C. roseus extracts. Indeed, most dereplicated MIAs are tightly associated within cluster A. Since clusterization depends on structural similarity, a single match to the MIADB-implemented GNPS allows for the propagation of the structure throughout an entire molecular family, indicating that most if not all the nodes of this cluster refer to MIAs. The seminal contribution of the MIADB to the tandem mass spectrometric databanks of MIA is expected to pave the way for the upload of such data by the numerous teams involved in MIA research all over the world, thereby contributing to making this tool more and more efficient to reach a quick and sharp insight into the MIAs content of any producing organism.
Dereplication of the MiaDB against the Mias previously available on the GNPS library. As a second validation assay, the MIADB was dereplicated against the GNPS library. For this purpose, the 172.mgf files were submitted to the GNPS online platform and all the hits between the MIADB and the GNPS were annotated. 19 of the total MIAs were identified as hits by the GNPS platform (Table 2).
These results indicate that the compounds from the 19 matches were correctly identified within the GNPS library, except in the case of epimers or isomers. Indeed, it should be noted that the matching process does not take into account the stereochemistry of the compounds ( Table 2).

Code availability
The LC-MS feature detection software (MassHunter ® ) used in this work is commercially available from Agilent ® .