Chemical Similarity Enrichment Analysis (ChemRICH) as alternative to biochemical pathway mapping for metabolomic datasets

Metabolomics answers a fundamental question in biology: How does metabolism respond to genetic, environmental or phenotypic perturbations? Combining several metabolomics assays can yield datasets for more than 800 structurally identified metabolites. However, biological interpretations of metabolic regulation in these datasets are hindered by inherent limits of pathway enrichment statistics. We have developed ChemRICH, a statistical enrichment approach that is based on chemical similarity rather than sparse biochemical knowledge annotations. ChemRICH utilizes structure similarity and chemical ontologies to map all known metabolites and name metabolic modules. Unlike pathway mapping, this strategy yields study-specific, non-overlapping sets of all identified metabolites. Subsequent enrichment statistics is superior to pathway enrichments because ChemRICH sets have a self-contained size where p-values do not rely on the size of a background database. We demonstrate ChemRICH’s efficiency on a public metabolomics data set discerning the development of type 1 diabetes in a non-obese diabetic mouse model. ChemRICH is available at www.chemrich.fiehnlab.ucdavis.edu

SCIeNtIFIC REPORtS | 7: 14567 | DOI: 10.1038/s41598-017-15231-w Calculation of PubChem substructure fingerprint. SMILES codes were converted into an rCDK molecule container which was then used to calculate the PubChem 881 bit fingerprint (ftp://ftp.ncbi.nlm.nih.gov/ pubchem/specifications/pubchem_fingerprints.txt). Fingerprints for representative compounds for each lipid class and selected MeSH classes were calculated and stored in an R object for re-use.
Storing ChemRICH sets in a database. A NoSQL database Apache CouchDB version 1.6.0 was used for storing the MeSH annotations and the substructure fingerprints for metabolites to enable fast queries.

Calculation of pair-wise chemical similarity matrices. Using the PubChem 881 fingerprints in the
Tanimoto chemical similarity coefficient a pair-wise similarity matrix was obtained. The formula for Tanimoto calculations was T (A, B) = AB/(A + B) − AB, where AB are the substructure bits found in both A and B compounds.
Calculating a chemical similarity tree and its decomposition. The pair-wise similarity matrix was then clustered using hierarchical clustering (hcl) method in R with average linkage parameters. The output tree of the hcl was divided into clusters using the dynamicTreecut 44 package in R. The tree was further converted into a phylogenetic tree object using the ape package in R and visualized as circular tree plot. Enrichment statistics. The compound-term mapping was used as set definitions for set enrichment analysis. We have used the 'one sample KS test' to test a null hypothesis that p-values for metabolites in a set are obtained from a reference uniform probability distribution of p-values as defined by the "punif " parameters of R. FDR were calculated using p.adjust function in R for set level p-values.

Web-application and deployment.
A web application was developed in the opencpu R package 45 to make ChemRICH available to non-R experts. The tool is also available as an R-package. The tool has been deployed at www.chemrich.fiehnlab.ucdavis.edu.
Source-code availability. The source for the ChemRICH and documentation is available at www.github. com/barupal/chemrich under the cc-by license.
Statistical analysis. Student t-tests were performed in R. Fold changes were calculated by taking the ratio of medians of the two experimental groups in the NOD mouse study. P-values were adjusted using the false discovery rate method in R. Statistical plotting was done using ggplot2 package in R.

Results
Metabolome data. For demonstrating the chemical enrichment clustering and statistics approach in ChemRICH, we used a publicly available plasma metabolomics dataset from a non-obese type 1 diabetes mouse model 23 . Data were downloaded from the NIH MetabolomicsWorkbench.org database 29 as study number ST000075. Non-obese diabetic (NOD) mice are a polygenic animal model and exhibit a susceptibility to spontaneously develop of type 1 diabetes as an autoimmune result of insulitis. We compared metabolic phenotypes of 36-week old animals that developed diabetes (n = 31) versus NOD mice that remained normoglycemic (n = 40). In this study, a total of 385 unique plasma metabolites were identified and quantified by normalized signal intensity by three metabolomic assays, untargeted primary metabolism screening by gas chromatography-time of flight mass spectrometry 46 , untargeted profiling of complex lipids by charged surface hybrid liquid chromatography-quadrupole time of flight mass spectrometry 47 , and targeted analysis of oxylipins 48 by reversed phase liquid chromatography-QTRAP tandem mass spectrometry. Details are provided in Supplement Table S1. More than half of all identified compounds were found to be potentially associated with hyperglycemia using a Student's t-test (raw p < 0.05) with an effect size ranging from 1.05 to 4.5-fold. 22% of all compounds showed a significant 2-fold-change (increased or decreased), highlighting large metabolic differences in diabetic mice beyond glucose metabolism. Figure 1 (left panel) gives an overview on fold-changes and univariate statistical significance in a volcano plot. We found an almost equal number of significantly increased and decreased compounds.
First, we queried all identified metabolites against the NCBI BioSystems pathway repository using their PubChem compound identifiers. This repository covered all major metabolic pathway databases including KEGG 31 , BioCyc 39 , WikiPathways 49 , Reactome 38 and Gene ontology 50 therefore it provided most comprehensive pathway annotations for metabolites. We found that 57% of all the known metabolites failed to be annotated to known biochemical pathways in this repository. Specifically, complex lipids and oxylipins lacked pathway annotations ( Fig. 1 and Supplement Table S1). We concluded pathway maps failed to provide biochemical overview sets for most identified metabolites in this typical metabolomic study. In addition, many metabolites are mapped to multiple overlapping pathways.

Development of the ChemRICH approach.
To obtain set definitions for all identified compounds in a metabolome study, we propose to utilize chemical ontology terms and chemical similarity mapping. We named this new approach, ChemRICH. ChemRICH relies on a starting database of metabolites along with their MeSH terms. This starting metabolite list includes all compounds that have been annotated with MeSH terms by the PubChem database. First PubChem CIDs, names and SMILES codes of compounds from the test study were searched in the ChemRICH database and their MeSH terms were retrieved. Next, MeSH terms were estimated using Tanimoto chemical similarity coefficients for the compounds that were not present in the ChemRICH database. MeSH terms for compounds found in newly entered metabolomic studies are now automatically added to the ChemRICH database. The overall development and organization of ChemRICH is shown in the Fig. 2. Precise steps in the use of ChemRICH for the test study are shown in the Fig. 3, including how to define non-overlapping sets of metabolites and how to use the KS-test for obtaining statistical significance for set enrichments. In the following sections we explain the rationale behind using the MeSH ontology, chemical similarity and the KS test for the ChemRICH enrichment analysis.
Medical subject headings provide metabolite sets defined by chemical classes. We started deriving metabolomic clusters for set enrichments using chemical ontology classes from the MeSH database. Chemical ontologies are useful because they represent a comprehensive and logical approach to classify all detected compounds in a metabolome study. These classes further provide meaningful naming of groups of metabolites, and in some cases, also lend biological relevance. Using the chemical ontology database MeSH, 49% of all identified metabolites of our test NOD mouse metabolome dataset were covered ( Fig. 1 and Supplement Table S1). CID, compounds names and SMILES codes were used to query compounds from the MeSH database.
While chemical ontology classification provided better metabolite coverage than biochemical pathway mapping, ontology classes could not directly be used for set enrichment statistical analysis. First and foremost, 51% of the identified compounds were not covered by MeSH. Second, many metabolites were annotated by more than one ontology class. Moreover, many novel metabolites might be detected that have not been annotated in existing databases.
Chemical similarity improves mapping of metabolite and sets. Novel compounds are frequently found that are not yet present in existing chemical ontologies, for example, triglycerides with unusual odd-chain fatty acids that may enter the blood stream as dietary components. Similarly, compounds with in-silico predicted  mass spectra 51 are often missing from ontology databases. We first calculated Tanimoto similarity coefficients for all unmapped metabolites (51%) using the rCDK package that provides the PubChem 881 bit substructure fingerprint in R. These substructure fingerprints were used to calculate Tanimoto similarity coefficients for all query compounds against 91,444 chemicals that are currently included in the MeSH database. All unmapped metabolites were added to the ontology class in which the most similar compounds were found at Tanimoto similarity scores > 0.90. If no ontology class with compounds at Tanimoto score > 0.9 were found, a slightly relaxed cutoff >0.75 Tanimoto score was used. This approach yielded MeSH annotations for up to 98% of total compounds (Fig. 3). Classic fatty acids and their variants, oxylipins, were found to have high Tanimoto similarities. We used a rule based approach to directly establish cluster memberships for each compound. SMILES code string patterns were searched against the SMILES for all the known compounds in the test data set using regular expression analysis in R. With this method, all 14 fatty acids, 51 oxylipins and 3 prostaglandins were correctly assigned to different clusters.
We also tested the idea of using chemical similarity alone to generate sets of chemicals that could be used for calculating enrichment statistics. The cluster detection algorithm DynamicTreeCut 44 had been developed for cluster detections in genomic data. We found that the algorithm also efficiently distinguishes groups of chemicals if the average Tanimoto score within a set is significantly higher than the scores between a cluster and all other clusters. If only chemical similarity is used this process yielded a list of 89 clusters that generally could be used to highlight sets of metabolites that are significantly different to a control group in a metabolomic study. Such a chemical similarity clustering tree is shown in Fig. 4, with cluster labels given in Supplement Table S1. However, pure chemical similarity clusters are difficult to name by intelligible, biologically or chemically meaningful class labels.
Instead, we used Tanimoto chemical similarity mapping in addition to the backbone of MeSH ontology. Tanimoto chemical similarity calculations enabled us to generate new clusters that were absent in the MeSH database. New clusters of chemicals might be produced whenever a completely new class of metabolites is discovered by metabolomics, for example, the fatty acyl esters of hydroxyl fatty acids (FAHFAs) 52,53 . In our given NOD mice data set, no FAHFA lipids were detected. Instead, we tested the concept by excluding the 'saturated sphingomyelins' chemical ontology definition from the list of known distinct sets. Indeed, we found that the actual saturated sphingomyelins in the NOD mice metabolomics dataset were identified as a distinct and novel set by the chemical similarity tree cutting approach 44 . This test showed that not only can Tanimoto chemical similarity add novel metabolites to the tree that lack MeSH chemical ontology information, but the dynamic tree cut approach automatically detects entire new groups of compounds and adds these compounds into the ChemRICH sets as a new cluster for set enrichment statistics. Final MeSH annotations for each metabolite are provided in Supplemental  Table S1.
Defining non-overlapping ontology classes for metabolites. By combining chemical similarity and MeSH ontology mappings we obtained non-overlapping classes for a maximum number of structurally identified metabolites. These classes formed the metabolite sets that were subsequently used for calculating set enrichment statistics. Non-overlapping classes overcome two major bottlenecks of set enrichment analyses. In non-overlapping sets each metabolite is used only once, avoiding bias for hub metabolites such as glutamate that is used in many different biochemical reactions. Second, non-overlapping sets avoid the need for correction for multiple hypothesis testing for sets with high overlaps. Naming these sets is performed in the following way in ChemRICH: first, all superclass and subclass annotations are retrieved for each compound and sorted by specificity as defined by the MeSH ontology tree structure. For instance, the metabolite 'betaine' maps to the class entry D02.675 as 'onium compounds' but also to the more specific child term D02.092.877 'quaternary ammonium compounds' . Compounds were associated to their most specific ontology classes until a minimum of three compounds per class was reached. This approach found 49 non-overlapping MeSH ontology classes for 85% of the identified metabolites in our NOD dataset. We named these classes ChemRICH sets. The residual number of 15% of the identified compounds remained annotated with their individual MeSH ontology.
ChemRICH sets enabled self-contained enrichment analysis of altered metabolites. The combined number of metabolites in ChemRICH sets were then defined as study-specific, background-independent database to be used for the calculation of metabolite-set enrichment p-values.
There are two statistical tests that can be used for statistical set enrichment that do not rely on background databases, the binomial test and the Kolmogorov-Smirnov (KS) test. Binomial tests require a p-value cutoff to define the significantly affected variables within a set, while the KS-test does not require such a threshold and is therefore more comprehensively including all metabolites in a metabolomic data set. The KS test obtains set-level significance p-values by comparing the cluster p-values against a theoretical p-value distribution ("puinf " in R). Hence, the KS test yields set level p-values that are not affected by any (biochemical) background database or by the total number of altered metabolites, unlike the commonly use hypergeometric pathway enrichment test. Moreover, the KS-test also included compounds with marginal p-values, unlike the bionomial enrichment statistics test.
We compared all identified NOD mouse plasma metabolites for diabetic versus non-diabetic mice using the ChemRICH sets, using the Kolmogorov-Smirnov-test for set enrichment statistical analysis. Results are visualized in a 2-dimensional scatter plot (Fig. 5) in which sets were sorted by their order on the chemical similarity  Supplement Table S1. Increased metabolites levels in diabetic mice are labeled as red nodes, decreased levels are marked in blue.
SCIeNtIFIC REPORtS | 7: 14567 | DOI:10.1038/s41598-017-15231-w tree (Fig. 4), a graph that is intentionally similar to pathway mapping graphs obtained from MetaboAnalyst 27 . This graph directly visualizes the most significant and largest metabolite sets organized according by overall chemical diversity across all identified metabolites. Enrichment results for individual cluster sets are given in Table 1. We here show for the first time a chemical ontology-based metabolite set definition in combination with chemical similarity calculations for calculation of metabolic set enrichment significance.

ChemRICH enrichment analysis improves biological and biochemical interpretations.
This dataset on non-obese diabetic versus normoglycemic mice has been published and carefully interpreted previously 23 . The authors noted a range of overt changes such as 'elevations in circulating triacylglyercides' and 'reductions in major structural lipids, most notably lysophosphatidylcholines and phosphatidylcholines' 23 . However, the authors noted only a total 18 manually annotated clusters of identified metabolites 23 whereas ChemRICH refined the annotation of systematically different metabolite clusters into 55 distinct sets. The authors visualized the data in simple Tanimoto chemical similarity maps 40,54 that obscured a very important difference in the diabetic and non-diabetic mice: all complex lipids were quite distinctly regulated between lipids with unsaturated fatty acyl groups, and complex lipids that consisted of only saturated fatty acyl chains 54 . Both cluster sizes and significance values for unsaturated triglycerides, unsaturated phosphatidylcholines, unsaturated sphingomyelins and a range of other lipid classes were markedly different from their saturated counterparts ( Fig. 5 and Table 1). Similarly, ChemRICH enrichment analysis very clearly points out the high significance of increased branch-chain amino acids in comparison to the aromatic-or sulfur-containing amino acid sets. Manual classifications as well as univariate statistic interpretations by the original authors did not point out these large differences. Similarly, our ChemRICH statistics showed clear significance differences in carbohydrate metabolism detailed with pentoses, hexoses, disaccharides and sugar acids on the one hand, and much lower significance for hexose phosphates and sugar alcohols on the other hand (Fig. 5). These overt phenotypes were not clearly delineated in the original publication 23 . Detailed carbohydrate interpretations are necessary as it has been reported that endogenous sugar acids may modulate the feeding behaviour of rats 55 and can also disturb leptin responsive signalling pathway 56 . Without ChemRICH, those refined clusters were not identified and interpretations focus mostly on broad categories such as carbohydrates or lipids as reported in the original paper 23 . Hence, ChemRICH calculations and visualizations improve the ability to perform biological interpretations. Results of ChemRICH analysis for the example NOD mice study are provided in Supplementary Table S1 and S2. ChemRICH has also been tested on a smaller dataset of 128 known compounds 57 for finding interpretable metabolic clusters (Supplementary Tables S3 and S4).
ChemRICH is available both online and as stand-alone R package. To make ChemRICH broadly available for the metabolomic community, we have used the OpenCPU web-framework to call our R-functions via a representational state transfer-application programming interface (REST-API). The input file must include SMILES codes and PubChem compound identifiers for all identified metabolites. This information can be easily obtained from the Chemical Translation Service 58 and the PubChem identifiers exchange service 59 . ChemRICH is hosted at http://chemrich.fiehnlab.ucdavis.edu. For the bioinformatics community we have coded the tool as R-package.

Discussion
Because of the limitations of pathways maps for the metabolomics data interpretation, we have developed ChemRICH as a next generation of set enrichment analysis tools for metabolomics. ChemRICH is an alternative approach to the pathway analysis. It uses structurally defined metabolite sets and self-contained statistical analysis for finding enriched metabolite sets. It fills a major gap in the interpretation of the metabolomics datasets. Combining chemical similarity and classification ontologies enables annotating and naming metabolites as non-overlapping chemical classes. A self-contained KS test allows discovering metabolite sets with the most statistical significance. The tool has been developed into an R-package and a web-app for straightforward use by the metabolomics community. We demonstrated that ChemRICH can outperform classic tools for visualizing and interpreting metabolomics datasets using a publicly available study and highlighting the chemical sets which were not discussed in the original paper.
Using classic pathway maps covered only 40% of the plasma metabolites identified in our NOD mouse test case. Such low coverage is particularly critical whenever multiple analytical assays are combined to detect metabolites from different chemical classes, including complex lipids or oxylipins. Limiting enrichment analysis to pathway maps underutilizes those datasets. In contrast, combining chemical similarity with established chemical ontologies can classify over 90% of the identified metabolites for enrichment analysis. We have used MeSH ontology classes because these classes are already mapped to the PubMed literature database. This will allow straightforward use of metabolite set labels for interpreting metabolic dysregulation in the context of biochemical enzymes, physiology, chemical exposure or anatomical changes by manual or automated text mining.
ChemRICH is particularly useful for clinical and epidemiological studies using blood or urine specimens. Non-targeted blood metabolomics detects compounds that originate from cellular metabolism as well as exposure compounds of xenobiotic origins including food polyphenols 60 or drug compounds 13 . ChemRICH efficiently places such exposome chemicals into metabolites sets to perform exposure enrichment analysis that exceeds classic pathway enrichments.
Statistical significance of metabolomics enrichment should not depend on the size of background databases used. Using non-overlapping sets of variables have been used for partitioning and trimming of gene ontology databases 61,62 as basis for enrichment statistics. Moreover, the KS-test based gene set enrichment analysis (GSEA) 42 has been widely accepted as the standard approach to perform set enrichment analysis for genomics studies [63][64][65] . It does not require splitting gene lists, it can take entire genomics dataset as input and it does not depend on any specific background database. We propose metabolomic enrichment tools should learn from these experiences and also use the KS-test as standard for enrichment analysis. Using chemical similarity instead of fixed definitions of biochemical pathway has several advantages. Using encoded chemical structures allows precise mapping of metabolite to set definitions, avoiding manual, often arbitrary annotations of metabolites to 'sets' . Importantly, most complex lipids are not mapped to biochemical pathways. Chemical similarity mapping detects classes of compounds that are not yet represented in chemical ontologies, for example, for in-silico predicted compounds by tools such as LipidBlast 51 or the Metabolome-in-silico-network expansion database, MINE 66 . Adding unmapped metabolites to ontology class by Tanimoto chemical similarity is analogous to predicting 'anatomical therapeutic chemical code' for drug-like compounds 67 .
Chemically similar compounds are in biochemical proximity 40 . This provides a strong rationale for chemical clustering of known metabolites and perform set enrichment using the detected clusters. ChemRICH complements MetaMapp which is used for visualizing the chemical diversity of metabolomics datasets using network graphs. Use of MetaMapp and ChemRICH will provide the complementary visualization and summarization outputs that can streamline the interpretation of a metabolomics study. However, the use of ChemRICH is limited by our ability to correctly identify compounds in non-targeted metabolite screens. Future expansions might map unknown metabolites as well using mass spectral similarity or sub-structure prediction from mass spectra. In MetaMapp, we have implemented using mass spectral similarity networks for visualizing large volume of spectral data 68 , including unknown metabolites. Clusters can be detected using a community detection algorithm, but annotating such clusters with clear chemical names, or using these for biological interpretations, is still limited.

Conclusion
ChemRICH outperforms classical pathway overrepresentation analysis approach for the interpretation of the metabolomics datasets. The approach can be used in studies to uncover biological mechanisms in organisms under a genetic or environmental stress in a system biology manner or finding risk factors for chronic diseases in exposome-wise association studies using blood specimens. ChemRICH is available via http://chemrich.fiehnlab. ucdavis.edu. The source code is available at www.github.com/barupal/chemrich.

Data availability
Metabolomics dataset is available at the Metabolomics Workbench repository (www.metabolomicsworkbenchlorg) with the accession number ST000075. ChemRICH R package and the code is available at www.github. com/barupal/chemrich