Significance estimation for large scale metabolomics annotations by spectral matching

The annotation of small molecules in untargeted mass spectrometry relies on the matching of fragment spectra to reference library spectra. While various spectrum-spectrum match scores exist, the field lacks statistical methods for estimating the false discovery rates (FDR) of these annotations. We present empirical Bayes and target-decoy based methods to estimate the false discovery rate (FDR) for 70 public metabolomics data sets. We show that the spectral matching settings need to be adjusted for each project. By adjusting the scoring parameters and thresholds, the number of annotations rose, on average, by +139% (ranging from −92 up to +5705%) when compared with a default parameter set available at GNPS. The FDR estimation methods presented will enable a user to assess the scoring criteria for large scale analysis of mass spectrometry based metabolomics data that has been essential in the advancement of proteomics, transcriptomics, and genomics science.

datasets that were absent from the target library GNPS. In all cases, we nd an isomer in the target library that has high structural similarity to the query; in all cases, the spectral similarity between query and this hit in the target library is 0.75 or higher using the MassBank score, and would be accepted using a xed threshold of 0.7. The Agilent dataset also contains Carbamazepine-10, and querying this spectrum results in the same hit in the GNPS library, with the same MassBank  version of the GNPS library, using the MassBank scoring function and ten independently created decoy databases. The line shows the mean values for the ten decoy databases, whereas the range of estimated q-values is shaded. Results for the empirical Bayes approach and the three target-decoy approaches. For the fragmentation tree-based method, we searched against the noise-ltered GNPS only, since this approach applies noise-ltering by design.   Figure 6: Annotation rates. Top: Bar graph of annotation rates for each data set.
Bottom: Annotation rate for 1% and 5% FDR cut-o in relation to the default cosine score of 0.7. The small annotation rates are due to a) less studied systems having few matches to molecules in public reference libraries, as there is a strong bias in them; and b) the limited reference spectra in the public domain that could be used for FDR estimations based on open access that allows redistribution and our inclusion criteria (see methods; we used the same spectra to determine the annotation rate at default GNPS settings) resulting in low resolution spectra not being included. This annotation value increases as the number of reference spectra that can be included and provided back to the community increase. Currently commercial licenses prevent the inclusion of larger libraries in the GNPS infrastructure and therefore this approach enables assessment of parameters to use for FDR calculations to use scoring parameters guided by statistics when performing a search with all the reference libraries accessible to the user. With MassBank, GNPS and other resources making this information freely available, the rate of annotations will increase when performing FDR estimations. .
To compute the FDR for a certain score threshold x s , we estimate the expected value of the fraction of incorrect hits among all hits (correct or incorrect) with score at least x s (Expected Bayes approach). Let S = {x 1 , x 2 , ..., x m } be the set of match scores ≥ x s . The probability P (X = k) is the probability that exactly k of these m scores are from incorrect hits. Then the expected FDR is The probability of having k successful trials out of a total of m trials is described by a Poisson binomial distribution, a discrete probability distribution of a sum of independent Bernoulli trials that are not necessarily identically distributed. The expected value of a random variable X that is Poisson binomial distributed with success probabilities PEP ( PEP (x i ) and, thus, the expected FDR reduces to the average PEP of all hits with score at least x s : Estimating the FDR using the integral of TP and FP distributions resulted in slightly worse estimates.
For the distribution of TP and FP scores, we tested six parametric distributions and their mirrored counterparts, namely Gamma, Gumbel, Weibull, Exponential, Normal and Logistic distribution. We mirror distributions around the y-axis and shift them by x max . To avoid innity probability density values for zero entries, we shift functions by −10 −5 . To nd the best model, each distribution is tted individually using Maximum Likelihood estimates, and evaluated based on the resulting likelihood. We nd that scores of FPs are modeled best using a Gamma distribution, and scores of TPs using either a mirrored Gamma distribution, a mirrored Gumbel distribution, or a mirrored Weibull distribution.
In application, we do not know which scores belong to correct and which to bogus hits. To this end, we use Expectation Maximization (EM) to nd the best-tting mixture of distributions according to likelihood. We consider three combinations of parametric distributions for FP/TP, namely Gamma/Gamma (mirrored), Gamma/Gumbel (mirrored) and Gamma/Weibull (mirrored).
Fitted distributions for FPs and TPs closely match the histograms for the Agilent and MassBank datasets, although it is not known to the EM algorithm which hits are true and which ones are false.
In most cases, the best-tting distributions of the two-component mixture correspond to the two individually determined distributions for TP and FP scores. We note that shapes of the available distributions for TPs (mirrored Gamma, Gumbel and Weibull distribution) are very similar.
In our evaluations using the Agilent and MassBank datasets, distributions were tted using Expectation Maximization. For the unltered Agilent dataset, EM determined that a mirrored Weibull distribution for TP and a Gamma distribution for FP best tted the joint distribution.
For the unltered MassBank dataset, EM determined a mirrored Gumbel distribution for TP and a Gamma distribution for FP. For both noise-ltered datasets, EM determined a mirrored Gamma distribution for TP and a Gamma distribution for FP.