Abstract
Mass spectrometry (MS) instruments and experimental protocols are rapidly advancing, but the software tools to analyse tandem mass spectra are lagging behind. We present a database search tool MSGF+ that is sensitive (it identifies more peptides than most other database search tools) and universal (it works well for diverse types of spectra, different configurations of MS instruments and different experimental protocols). We benchmark MSGF+ using diverse spectral data sets: (i) spectra of varying fragmentation methods; (ii) spectra of multiple enzyme digests; (iii) spectra of phosphorylated peptides; and (iv) spectra of peptides with unusual fragmentation propensities produced by a novel alphalytic protease. For all these data sets, MSGF+ significantly increases the number of identified peptides compared with commonly used methods for peptide identifications. We emphasize that although MSGF+ is not specifically designed for any particular experimental setup, it improves on the performance of tools specifically designed for these applications (for example, specialized tools for phosphoproteomics).
Introduction
Mass spectrometry (MS) instruments and experimental protocols have greatly advanced over the last decade. New fragmentation technologies have emerged and highprecision mass spectrometers like Orbitrap have become widely available. Although trypsin remains a dominant protease in proteomics studies, digesting proteins with diverse proteases is becoming popular^{1}. Empowered by these changes, MS researchers now have diverse choices with respect to the questions: ‘what fragmentation method to use?’, ‘how accurate should be the measurements of the masstocharge (m/z) ratios?’, ‘what proteases to use?’ and ‘what posttranslational modification (PTM) to focus on (for example, phosphorylation)?’. Depending on these choices, the resulting tandem mass (MS/MS) spectra vary in fragmentation propensities and precision. Therefore, unlike in the past when lowprecision collisioninduced dissociation (CID) spectra of tryptic peptides dominated the field, spectral data sets generated today are very diverse. Unfortunately, the popular MS/MS database search tools such as SEQUEST^{2} and Mascot^{3} have not kept pace with the increased diversity of the data. Although several new MS/MS database search engines were recently developed, including Andromeda^{4}, Morpheus^{5}, and MS Amanda^{6}, they have resulted in only minor improvements as compared with SEQUEST and Mascot.
Many efforts have been invested into making existing MS/MS search tools compatible with new types of data. For example, several pre or postprocessing strategies have been proposed^{7,8}, resulting in small improvement in the performance of database search tools. To further boost the performance, MS/MS database search tools are combined with statistical modelling tools such as PeptideProphet^{9}, Percolator^{10} and IDPicker^{11}. These tools do not find new peptide–spectrum matches (PSMs), but rather rescore PSMs reported by a database search tool using more complex scoring and output highscoring PSMs. Although they often improve the performance of a database search tool, their performance is negatively affected when the database search tool fails to find correct PSMs^{12}. Another downside of the pre or postprocessing strategies and statistical modelling tools is that, as they are often not integrated into database search tools, using them complicates the analysis of MS/MS spectra. Moreover, as different laboratories employ different combinations of tools (see Fig. 1), even for the same data, capabilities of analysing the data vary widely and results obtained in one laboratory are often difficult to reproduce in another laboratory^{13}.
In a recent review, Noble and MacCoss^{14} pointed out that ‘the field (of MS) is still missing a generic analysis platform that can be adapted automatically and in a principled manner, to handle spectra produced by any given fragmentation protocol’. Our MSGF+ is a step towards achieving this goal, representing a universal database search tool that performs well for diverse types of spectral data sets. MSGF+ works well (that is, identifies more peptides than other MS/MS tools that we tested) for spectra generated using diverse configurations of MS instruments and experimental protocols. However, the main contribution of this study is not the increase in the number of identifications for dozens of various fragmentation methods and experimental protocols but rather the fact that it represents the first truly universal MS/MS database search tool. We emphasize that MSGF+ is not customized for specific spectral data sets but rather uses a robust probabilistic model that works well across all data sets.
MSGF+ is universal because it automatically derives scoring parameters from thousands of PSMs without prior knowledge of the type of the spectra^{12}. We represent various types of spectra as a graph where paths represent spectral types (Fig. 1). For each spectral type, MSGF+ learns scoring parameters separately and scores a PSM using a different set of scoring parameters depending on the spectral type. MSGF+ can train scoring parameters for any spectral type (including spectral types not specified in Fig. 1) or use pretrained scoring parameters. It takes over the authority to train scoring parameters to the users and makes the training easy.
The key advantage of MSGF+ over existing approaches is its ability to compute rigorous Evalues (using the generating function approach^{15}) and thus to boost the number of peptide identifications. Although the generating function approach from ref. 15 worked well in a variety of studies^{16,17,18,19}, the question of applying it to modified peptides and to highprecision MS/MS spectra remains open. In this study we address these issues, thus making the generating function approach applicable to all types of spectra.
We demonstrate the performance of MSGF+ using various previously studied data sets^{1,20,21,22,23}: spectra of tryptic peptides generated using CID, higherenergy collisional dissociation (HCD) and electron transfer dissociation (ETD) in combination with either linear ion trap or Orbitrap readout, spectra of multiple enzyme digests, spectra of phosphopeptides and spectra or a novel protease alphalytic protease (αLP). For all these data sets, we show that MSGF+ outperforms popular tools for peptide identification such as Mascot+Percolator.
Results
MSGF+ scoring
Database search tools use a scoring function Score(P, S) to evaluate a PSM of a peptide P and a spectrum S, and further compute statistical significance of the resulting PSMs. In this study, we use Evalues to evaluate statistical significance of individual PSMs (referred as spectral Evalues) and the targetdecoy approach to estimate false discovery rates (FDRs). See Gupta et al.^{24} for the details of our probabilistic framework.
Let P_{S} be a peptide that generated S. A scoring function is adequate for S (with respect to a protein database ProteinDB) if the correct peptide attains the maximal score in the database, that is, max_{PεProteinDB} Score(P, S)=Score(P_{S}, S). A ‘good’ scoring function should satisfy the following three conditions. First, it should be adequate for the great majority of spectra. Second, the algorithm for PSM scoring should be fast. Third, the algorithm for computing statistical significance (for example, Evalues) of PSMs should be fast and accurate.
MSGF+ uses a very simple dotproduct scoring Score(P, S)=P*·S* after converting peptide P and spectrum S into peptide vector P* and spectral vector S* (the spectral vector was called the prefixresiduemass spectrum in the previous publications^{12,25}). Conversion of a spectrum S into a spectral vector S* uses a probabilistic model that ensures that the resulting dotproduct scoring is adequate^{26} (first condition). At the same time, it makes scoring and computing accurate Evalues fast^{15} (second and third condition). This simple ‘dotproduct’ scoring model contrasts with many other database search^{2,4,27,28} and rescoring^{9,10} tools, using sophisticated scoring functions that often make it difficult to satisfy the third condition.
MSGF+ workflow
MSGF+ takes a spectral data set Spectra and a protein database ProteinDB as an input and outputs a set of scored PSMs along with Evalue estimates. It uses open source application programming interfaces jmzML^{29}, jmzReader^{30} and jmzIdentML^{31}, and supports the HUPO Proteomics Standard Initiative standard file formats—mzML^{32} and mzIdentML^{33}. Owing to these developments, MSGF+ has been already adopted in many proteomics pipelines and postprocessing tools.
The workflow of MSGF+ comprises the following four steps: generating spectral vectors, searching a protein database, computing Evalues of PSMs and estimating FDRs. Below, we describe each step as well as how MSGF+ takes advantage of high precision spectra.
Generating spectral vectors
A (nonmodified) peptide is defined as a string over the alphabet A of 20 standard amino acids. Let A^{+} be an extended amino acid set containing both unmodified and modified amino acids. For an (unmodified) amino acid aεA, let Mod(a)⊂A^{+} be the set that contains a and all its modified amino acids. For example, if T (Thr) and T* (phosphorylated Thr) are in A^{+}, Mod(T)={T, T*}. Given a peptide P=a_{1}…a_{k}, define PV=pυ_{1}…pυ_{k} as a variant of P if pυ_{i} ε Mod(a_{i}) for all i (1≤i≤k).
MSGF+ converts spectra into spectral vectors^{12,25}. A spectral vector of a spectrum S is an Mdimensional vector with integer values, where M=PrecursorMass(S) is the nominal precursor mass of S. Here we consider nominal precursor masses, representing that the sum of nominal masses of amino acids of the peptide generated the spectrum. As in many cases, the precise nominal precursor mass is unknown (for example, MS instruments often choose second or third isotope peak instead of monoisotope peak from MS1 spectrum), multiple spectral vectors are generated separately for each possible nominal precursor mass and the score of a peptide of mass M is computed from the spectral vector of precursor mass M.
The conversion from an experimental spectrum to a spectral vector proceeds as follows. A spectrum S={(mz_{1}, rank_{1}),…,(mz_{l}, rank_{l})} is represented as a set of ranked peaks where the ith highest intensity peak gets rank i (mz_{j} and rank_{j} represent m/z and rank of jth peak, respectively). An ion type is represented as a triplet of integers charge, offset and sign, where sign represents whether the ion type is a prefix ion (sign=1) or a suffix ion (sign=−1). For example, singly charged bions and yions correspond to ion types (1, 1, 1) and (1, 19, −1), respectively. Neutral losses and hydrogen transfers are also considered as ion types, for example, singly charged z·ions corresponds to (1, 3, −1). Given an ion type ion=(charge, offset, sign), one can turn a spectrum S into S_{ion}={(mass_{1}, rs_{1}),…, (mass_{l}, rs_{l})} using the following transformation:
where [x] represents the closest integer to x and RankScore(ion, rank) is a precomputed function that takes an ion type ion and an integer rank, and returns a probabilistic loglikelihood score defined in refs 12, 26. It is noteworthy that 0.9995 is a rescaling constant for minimizing rounding errors (see Supplementary Table 1). In practice, RankScore(ion, rank) also accounts for the location of the observed peak and the precursor charge and mass of the spectrum, which are omitted here for simplification. Ion types contributing to scoring are selected from the training set as described in Kim et al.^{12} Assume that is a set of ion types that are selected. The spectral vector of S (denoted by S= (s_{1},…,s_{M})) is computed as follows:
where RankScore(ion, ∞) represents the score given when ion is missing.
We also define a peptide vector of a variant as follows. Let Mass(a) be the nominal mass of a (possibly modified) amino acid a. For example, Mass(T)=101 and the mass of phosphorylated Thr is Mass(T*)=181. Given a variant PV=pυ_{1}…pυ_{k}, define the mass of PV as . Given a variant PV=pυ_{1}…pυ_{k} of mass M, we define its peptide vector (denoted by PV) as a 0–1 vector (m_{1},…,m_{M}) with (n−1) 1 s, such that m_{i}=1 if i equals to Mass(pυ_{1})+…+Mass(pυ_{j})(1≤j≤k).
The MSGF+ score of a PSM (PV, S) is defined as if Mass(PV)=PrecursorMass(S) and −∞ otherwise. The MSGF+ score represents the log likelihood ratio described in ref. 26.
Searching a protein database
We define ProteinDB^{+} as the set of all variants (with respect to an extended amino acid set A^{+}) derived from ProteinDB. The goal of MSGF+ database search is to solve the following problem: given a spectral data set Spectra and a protein database ProteinDB, for each spectrum S ε Spectra find a variant PV_{S,ProteinDB} such that
In contrast to a traditional spectrumbased MS/MS database search approach that compares each spectrum against all peptides, MSGF+ uses an alternative peptidebased approach that computes the suffix array to compare each peptide against all spectra with the same precursor mass. See Supplementary Note 1 for the details of MSGF+ approach to the database search.
Computing Evalues of PSMs
The scores of PSMs reported by existing MS/MS database search tools are often poorly correlated with their Evalues^{34}. It is important to rank PSMs based on their Evalues, because such ranking (rather than ranking based on ‘raw scores’) often dramatically increases the number of identified spectra under a given FDR^{15,35}. Many database search tools estimate an Evalue of a PSM based on an approximation of a tail of the score distribution specific to the spectrum using peptides in the database^{27,28}. As this approach is known to result in biased estimates of Evalues^{15}, MSGF+ adopted the generating function approach to rigorously compute Evalues of PSMs using the score distribution of all peptides^{15}. Our scoring model is essential here, because the generating function approach is easily applicable to the scoring functions that can be represented as a dotproduct of vectors^{24}. Adopting the generating function approach improves the accuracy of Evalue estimates and increases the number of identified peptides as was recently confirmed by an independent work on applying it to the XCorr score in SEQUEST^{35}.
Given a spectrum S, a score threshold t, an extended set of amino acids A^{+} and a database size N, we define Evalue (S, A^{+}, t,N) as the expected number of variants PV (as defined by A^{+}) with MSGFScore(PV,S)≥t in a random protein database of size N. To compute Evalue(S, A^{+}, t, N), we first compute spectral Evalue Evalue(S, A^{+}, t), the expected number of variants PV with MSGFScore(PV, S)≥t given a single random peptide. A single random peptide models a random peptide starting at a fixed position in a random protein database.
We consider a set of all possible (unmodified) peptides of length k (where k is a large number) and select a random peptide uniformly from this set (that is, the probability of selecting a peptide is ). In practice, to reflect different frequencies of amino acids in a database (for example, Leu is usually more common than Trp), we define the probability of a peptide P=a_{1}…a_{k} as , where Prob(a) is the frequency of amino acid a in a protein database. Note that this does not change the algorithm to compute the spectral Evalues. We say that a peptide P produces a variant PV if PV is a variant of a prefix of P. For example, PEPT* and PEPTI are produced by PEPTIDE. Given a spectrum S, let V(t) be the set of all variants PV with MSGFScore(PV, S)≥t. For every variant PV, there are 20^{k−PV} peptides of length k producing a variant PV (PV stands for the number of amino acids in PV). Therefore, expected number of variants per random peptide with a score equal or better than t is
As a variant is a string over the alphabet A^{+}, this expression can be computed using the generating function approach^{15}. Given a spectrum S with S=s_{1}…s_{M}, consider a directed acyclic graph called an amino acid graph G(V, E, A^{+}) with V={0, …, M} and E={(i, j)j−i ε Mass(a) for a ε A^{+}}, where the score of a vertex i is defined as s_{i}, the probability of an edge is defined as , the score of a path is defined as the sum of scores of its vertices and the probability of a path is defined as the product of probabilities of its edges. A path in an amino acid graph represents a variant. Therefore, Evalue(S, A^{+}, t) equals to the sum of probabilities of all paths from 0 to M with scores equal or better than t, and can be computed using parametric dynamic programming^{15,26,36}.
Although spectral Evalues are useful for evaluating statistical significance of individual PSMs (independently of the database), they need to be transformed into Evalue(S, A^{+}, t, N) to take into account the fact that the database search represents ‘multiple testing’ where multiple variants (arising from different database peptides) are scored against a spectrum^{37}. Evalues can be approximated as follows:
where N is the size of the database. It is noteworthy that as protein databases contain many repeated peptides, it is important to reflect the effective size of the database that is estimated as the number of unique peptides of certain length.
Estimating FDRs
MSGF+ estimates FDRs using the target decoy approach^{38,39}. See Supplementary Note 2 for details.
From lowprecision to highprecision MS/MS spectra
Mass spectrometers are usually divided into highprecision (denoted by H) and lowprecision (denoted by L) instruments. Depending on whether the precursor and product ions are measured with low or high precision, the spectra are divided into LL, LH, HL and HH spectra (LH spectra are hardly ever used in proteomics studies). Although it may appear that extending the generating function approach from LL (as defined in ref. 15) to HL and HH spectra is a simple matter of tuning parameters that control the error tolerance, the situation is more complex. Here we explain how MSGF+ takes advantage of highprecision product ion peaks.
Let RMass(a) be the real mass of an amino acid a. For a variant PV=pυ_{1}…pυ_{k}, let and RPrecursorMass(S) be the real precursor mass of a spectrum S. We previously assumed that Mass(PV) and PrecursorMass(S) are integers and defined MSGFScore(PV,S)=PV·S if Mass(PV)=PrecursorMass(S) and −∞ otherwise. It is noteworthy that this condition, although appropriate for LL spectra, is weak for HL and HH spectra, because it may be satisfied even when the real mass RMass(PV) significantly deviates (for example, up to 0.5 Da) from RPrecursorMass(S). Let represent the condition a−b<Δ. To take advantage of accurate precursor masses in HL and HH spectra, the condition Mass(PV)=PrecursorMass(S) has to be redefined to , where Δ is the precursor mass tolerance. The database search problem with this modified definition of MSGFScore is now described by the following equation:
where Spectra_{RMass(PV)} represents the set of spectra S ε Spectra satisfying .
The key part of the generating function approach is the assumption that amino acids have integer masses (otherwise the parametric dynamic programming is difficult to implement). However, rounding amino acid masses to integers introduces errors. These rounding errors reduce after rescaling by 0.9995, making them appropriate for LL and HL spectra. However, for HH spectra the rounding errors remain too large even after rescaling, prohibiting MSGF+ from benefiting from precise product ion peaks. A larger rescaling constant could better accommodate the mass accuracy, for example, the rescaling constant 274.335215 allows one to model spectra with 2.5 p.p.m. accuracy^{40}. However, as the time complexity of the generating function algorithm is proportional to the rescaling constant, this rescaling makes computing Evalues prohibitively slow.
Here we present a new scoring algorithm taking advantage of the accurate product ion masses while not substantially increasing the running time of MSGF+. In ref. 26, we introduced an abstract model (seemingly unrelated to MS) that described a probabilistic process of transforming a Boolean string (peptide vector) into another Boolean string (spectral vector). This model, although adequate for lowprecision spectra, needs to be modified for highprecision spectra. Here we model a peptide as a Boolean string (as before) but model a spectrum as a directed acyclic graph (DAG) and further apply a transformation of a Boolean string into a DAG for scoring real PSMs (see Methods for details).
Our new idea behind the DAG modelling is as follows. Consider peaks at masses 100.01 and 157.4 that will be transformed into integer bins 100 and 157 in the Boolean string representation of the spectrum. After this transformation, we lose information about the exact difference between these two masses. However, in our new spectral DAG model this information is retained in edges of the spectral DAG and used in the scoring.
Data sets
Overall, we used 19 data sets (≈2.83 million spectra from human, yeast, mouse and Schizosaccharomyces pombe) reflecting the diversity of MS data, corresponding to 17 distinct spectral types shown in Fig. 1 (see Methods for details on the data sets). For all these data sets, we benchmarked MSGF+ against popular tools for peptide identification such as Mascot+Percolator.
Comparison of MSGF+ with Mascot+Percolator
We compared the numbers of identified PSMs at 1% FDR for MSGF+ and Mascot+Percolator (that is, PSMs reported by Mascot and rescored by Percolator). Mascot+Percolator (Mascot version 2.3.02 integrating Percolator) was used for the comparison, because it represents a popular choice for peptide identification. We also tested several other tools such as SEQUEST, InsPecT^{25} and OMSSA but do not report their results because they identified significantly fewer PSMs as compared with Mascot+Percolator. See Supplementary Table S2 for database search parameters.
For all the 19 data sets, MSGF+ identified significantly more PSMs compared with Mascot+Percolator (Fig. 2). Figure 3a shows the benchmarking results for the five human data sets generated with varying fragmentations and instruments^{20}. Percolator greatly increased the number of identifications as compared with Mascot, but for all these data sets MSGF+ identified significantly more PSMs (17–38%) than Mascot+Percolator (see Supplementary Fig. S1 for Venn diagrams of MSGF+ and Mascot+Percolator identifications). We also compared the number of identifications reported by the original study^{20}, which also used Mascot+Percolator along with inhouse pre and postprocessing tools. In this comparison, MSGF+ also showed an improved performance (identifying 16–55% more PSMs).
To figure out how each tool benefits from highprecision product ion peaks, for the three out of five human data sets representing HH spectra, we ran MSGF+, Mascot+Percolator and Mascot using the parameters for HL spectra, that is, using 0.6 Da fragment mass tolerance for Mascot and Mascot+Percolator, and using the scoring model for lowprecision spectra for MSGF+. For every tool, the number of identifications was higher when the parameters for HH spectra were used, but the difference varied depending on the data set (Fig. 3b), and was negligible for ETD spectra.
Figure 3c shows the comparison for the ten yeast data sets generated with varying fragmentations (CID or ETD) and enzymes (Trypsin, LysC, ArgC, GluC or AspN)^{1}. Again, for all these data sets, MSGF+ identified significantly more PSMs (34–168%) than Mascot+Percolator (Fig. 3c). In ref. 1, using OMSSA (and inhouse tools for pre and postprocessing), the authors reported the number of identified peptides at 1% peptidelevel FDR that are matched to proteins identified at 1% proteinlevel FDR. We compared these numbers with the numbers of identified peptides at 1% peptidelevel FDR using MSGF+ (Fig. 3d). Note that this comparison is unfair because peptide identifications by MSGF+ were not filtered out according to the protein that they are matched to. However, even after considering that, the results show that for most of the data sets MSGF+ identified many more peptides than the original report.
To see whether our scoring model can capture the fragmentation propensities specific to phosphopeptides, we generated a scoring parameter set for (CID, Low, Phosphorylation, Trypsin). For the mouse data set corresponding to (CID, Low, Phosphorylation, Trypsin), we compared the numbers of identified PSMs for MSGF+ with and without using the phosphorylationspecific parameter set, Mascot+Percolator and InsPecT equipped with a dedicated scoring model for (CID, Low, Phosphorylation, Trypsin)^{41} (Supplementary Fig. S2a). Interestingly, without phosphorylationspecific scoring parameters, MSGF+ outperformed both tools, identifying 37% and 44% more PSMs than Mascot+Percolator and InsPecT, respectively. With phosphorylationspecific parameters, MSGF+ identified 9% more PSMs (and 12% more PSMs of phosphopeptides), confirming that our scoring model successfully captures phosphorylationspecific fragmentation propensities.
A similar result was obtained for a (CID, Low, Ubiquitination, Trypsin) data set (Supplementary Fig. S3). We emphasize that MSGF+ does not ‘know’ anything about the peculiarities of the phosphorylation or ubiquitination, and simply trains the scoring parameters in exactly the same way it does for other spectral types. This ability to easily train modificationspecific scoring parameters for any modification will greatly benefit MS researchers studying PTMs.
MSGF+ for identifying peptides produced by a new protease
αLP is a new protease with cleavage specificities somewhat ‘orthogonal’ to trypsin^{23}. MSGF+ was applied to the study of αLP using two S. pombe data sets corresponding to (CID, Low, Standard, αLP) and (ETD, Low, Standard, αLP). We ran Mascot+Percolator, OMSSA and MSGF+ by specifying ‘None’ as an enzyme. As αLP produces peptides with different fragmentation propensities than tryptic peptides, Mascot+Percolator and OMSSA performed very poorly for this novel spectral type. In contrast, MSGF+ identified 3,535 and 2,829 PSMs from the (CID, Low, Standard, αLP) and (ETD, Low, Standard, αLP) dataset using the scoring parameters for (CID, Low, Standard, Trypsin) and (ETD, Low, Standard, Trypsin), respectively (Supplementary Fig. S2b). The superior performance of MSGF+ over Mascot+Percolator and OMSSA is because its scoring function is adequate for αLP peptides (correct peptide attains the maximal score) for a large portion of the spectra even when the search space is large (that is, no enzyme is specified). In fact, for the human data set corresponding to (ETD, Low, Standard, Trypsin), when no enzyme was specified and precursor mass tolerance 2.5 Da was used, MSGF+ identified 10,937 PSMs, only 34% less as compared with the fully tryptic search with 7 p.p.m. precursor mass tolerance.
Using the identified PSMs by MSGF+, we trained scoring parameters for (CID, Low, Standard, αLP) and (ETD, Low, Standard, αLP). When these αLPspecific scoring parameters were used, the number of identified PSMs further increased to 4,788 (+35%) and 3,313 (+17%) for (CID, Low, Standard, αLP) and (ETD, Low, Standard, αLP), respectively, showing the usefulness of MSGF+ for studies of new proteases.
Thus, αLP represents a new alternative to trypsin, greatly increasing the PTM and protein sequence coverages, but generating spectra with unusual fragmentation propensities. We emphasize that the capabilities of αLP are not obvious when Mascot+Percolator or another tool is used, because it fails to identify αLP peptides. The details on αLP protease have been discussed in a separate paper^{23}.
Running time of MSGF+
We measured the running time of MSGF+ and Mascot+Percolator for LL, HL and HH spectra for various spectral types. For all the searches, MSGF+ and Mascot+Percolator showed similar running times (Supplementary Fig. S2c,d).
Discussion
Our analysis and recent independent studies^{35,42,43,44} showed that for diverse types of spectral data sets, MSGF+ identifies more PSMs as compared with existing database search tools such as Mascot, X!Tandem, OMSSA, Crux, Comet and InsPecT, and statistical modelling tools such as Percolator. We emphasize that the generating function approach for accurately computing Evalues significantly contributes to the improved performance of MSGF+. For example, when Evalues instead of MSGF scores were used to cut off the results, the number of identified PSMs increased approximately by 70%, 50% and 20% for LL, HL and HH spectra, respectively.
Although we focused on demonstrating MSGF+ as a standalone tool, we emphasize that MSGF+ can be combined with various other proteomics analysis tools. As we have decided to release MSGF+ in 2012 well before this paper was prepared for a journal submission, MSGF+ has already been integrated into the following pipelines and statistical modelling tools: TransProteomics Pipeline^{45}, GalaxyP^{46}, ProteoSuite^{47}, IDPicker^{11}, SearchGUI^{48}, Scaffold^{48}, ProteoSAFe, Skyline^{49} and Percolator^{10,50}. Peptide identification tools that combine the results of multiple database search tools such as MSblender^{51}, PeptideShaker^{52} and PepArML^{53} also currently support MSGF+. MSGF+ is freely available at http://proteomics.ucsd.edu.
Methods
Spectral DAG model
Given an extended alphabet A^{+}, we first explain how to convert a spectrum S into a labelled DAG G. G=(V,E) has a vertex set V={0, …, M=PrecursorMass(S)} and an edge set E={(i, j)j−i ε Mass(a) for a ε A^{+}}. For simplicity, suppose that the set of ion types ={(1, 0, 1)} (that is, only singly charged prefix ions with an offset zero contribute to the scoring). Given a constant δ called a fragment mass tolerance, two peaks of S with m/z x and y form a duo if y−x is approximately equal to a mass of an amino acid, that is, for a ε A^{+}. The vertex label s_{i} and the edge label s_{i,j} of G are defined as follows: s_{i}=1 if there exists a peak of mass x satisfying [0.9995·x]=i and s_{i}=0 otherwise; s_{i,j}=1, if there exists a duo of peaks with masses x and y such that [0.9995·x]=i and [0.9995·y]=j, and s_{i,j}=0 otherwise (see Fig. 4 for an example).
Let P=p_{1}…p_{M} be a Boolean string representing a peptide. Similar to the studies by Kim et al.^{26} where a peptide string generates a spectrum string, we now assume that a peptide string generates a DAG. The probability of a peptide P generating a DAG G is defined as follows:
where Prob(xy) is a 2 × 2 matrix representing the probability of a peptide character y (0 or 1) generating a vertex label x and Prob(xy, z) is a 2 × 4 matrix representing the probability of a pair of peptide characters y and z generating an edge label x (Table 1). In practice, β_{1}≈β_{2}≈β_{3} (see Table 1b).
When applying this model for scoring a peptide P and a DAG G, we consider a test comparing two hypotheses: one assuming G is generated by P and the other assuming G is generated by an ‘empty’ string consisting of all zeros (denoted by O). The loglikelihood score of (P,G) (denoted Score(P,G)) is defined as follows (see Fig. 5 for an example):
Note that the last equation assumes that only the edges (i, j) with p_{i}=p_{j}=1 contribute to the edge scoring because β_{1}≈β_{2}≈β_{3}.
In practice, we generate multiple DAGs for a single spectrum, one for each ion ε . To generate an ion DAG for ion=(charge, offset, sign) with a real offset r offset, (for example, real offset of the singly charged bion is 1.008), we first convert S={(mz_{1}, rank_{1}), …, (mz_{l}, rank_{l})} into S′={(mass_{1}, rank_{1}), …, (mass_{l}, rank_{l})} using the following transformation:
Each peak of S representing ion corresponds to a peak of this converted spectrum S′ representing an ion type (1, 0, 1). Therefore, the vertex and edge labels of the ion DAG for ion are defined as outlined before, but using S′ instead of S (Fig. 4).
In reality, vertex and edge labels in the ion DAGs are integers rather than Boolean values. Given a converted spectrum S′, we first remove all peaks (x, rank) if there exists another peak (x′, rank′) where [0.9995·x]=[0.9995·x′] and rank > rank′. The vertex label s_{i} is defined as follows: s_{i}=rank if there exists a peak (x, rank) satisfying [0.9995·x]=i and s_{i}=0 otherwise. For an integer m, let AminoAcid(m) be the set of amino acids a ε A^{+} satisfying Mass(a)=m (for example, AminoAcid(128)={Gln, Lys}). The edge label s_{i,j} is defined as follows: s_{i,j}=[100·min_{aεAminoAcid(j−i)}(y−x−RMass(a))] if there exists a duo of peaks with masses x and y such that [0.9995·x]=i and [0.9995·y]=j, and s_{i,j}=∞ otherwise. The constant 100 is multiplied to discretize the realvalued errors into bins of size 0.01 Da.
In this ion DAG representation, vertex labels encode the information on the intensities of individual peaks and the edge labels encode the information on the mass errors of pairs of peaks assuming they represent consecutive peaks of the same ion type. Note that edge labels take into account the spacing between peaks but do not take into account the peak intensities.
Supplementary Note 3 describes how to integrate information from various ion DAGs into a single spectral DAG.
Human data sets with varying fragmentations and instruments
Five human data sets corresponding to the spectral types (CID, Low, Standard, Trypsin), (CID, High, Standard, Trypsin), (ETD, Low, Standard, Trypsin), (ETD, High,Standard, Trypsin) and (HCD, High, Standard, Trypsin) contain 38,401, 33,586, 30,451, 25,734 and 37,810 spectra, respectively. These data sets are generated in the Heck Laboratory (Utrecht University). HEK293 wholecell lysates were digested by trypsin and analysed by LTQOrbitrap Velos (Thermo Fisher Scientific, Bremen), using combinations of one of the three fragmentation modes CID, ETD and HCD, and either ion trap or Orbitrap readout for product ion m/z. The detailed experimental procedures are described in ref. 20.
Yeast data sets with varying enzymes
Ten yeast data sets corresponding to the spectral types (CID, Low, Standard, Trypsin), (CID, Low, Standard, LysC), (CID, Low, Standard, ArgC), (CID, Low, Standard, GluC), (CID, Low, Standard, AspN), (ETD, Low, Standard, Trypsin), (ETD, Low, Standard, LysC), (ETD, Low, Standard, ArgC), (ETD, Low, Standard, GluC) and (ETD, Low, Standard, AspN) contain 333,203, 278,336, 114,351, 81,669, 251,974, 72,463, 246,428, 204,860, 88,403 and 262,635 spectra, respectively. These data sets were generated in the Coon Laboratory (University of Wisconsin Madison). Yeast wholecell lysates were digested separately, with either trypsin, LysC, ArgC, GluC or AspN, separated into 12 fractions via strong cation exchange chromatography and analysed in triplicate with an ETDenabled LTQOrbitrap mass spectrometer, where peptide fragmentation was accomplished either with CID or ETD using the decisiontree acquisition mode^{54}. We downloaded 180 (5 enzymes × 12 fractions × 3 replicates) spectrum files (Thermo RAW format) and converted each raw file into two mgf files, one containing CID and the other containing ETD spectra using ‘msconvert’ in ProteoWizard^{55} with ‘no filtering’ option. The conversion was unsuccessful for 6 out of 180 files (5 from ArgC and 1 from GluC digests). These six files were removed in the further analyses. The detailed experimental procedures are described in ref. 1.
Mouse data set of phosphopeptides
A mouse data set corresponding to the spectral type (CID, Low, Phosphorylation, Trypsin) contains 181,093 spectra. This data set was generated from the Gygi Laboratory (Harvard Medical School). Nine mouse organ proteins were digested with trypsin and the resulting peptides were fractionated via strong cation exchange. Phosphopeptides were enriched via immobilized metal affinity chromatography and analysed in duplicates via LCMS/MS on an LTQOrbitrap mass spectrometer. Out of nine organ tissues analysed, we used the spectra generated from the brain tissue. The detailed experimental procedures are described in ref. 21.
S. Pombe data sets with αLP digest
Two data sets corresponding to the spectral type (CID, Low, Standard, αLP) and (ETD, Low, Standard, αLP) contain 49,167 spectra each. These data sets were generated in the Komives Laboratory (University of California, San Diego). The detailed experimental procedures to generate these data sets are as follows. Wildtype S. pombe cells were lysed in 50 mM Tris–HCl pH 8.0, 150 mM NaCl, 5 mM EDTA, 10% glycerol, 50 mM NaF, 0.1 mM Na3VO4, 0.2% NP40 and stored at −80 °C. The debris was pelleted and then the supernatant was collected. The pellet was extracted according to ref. 56. Briefly, the pellet was resuspended in 200 μl of 0.1 M NaOH, 0.05 M ETDA, 2% SDS and 2% βmercaptoethanol and incubated at 90 °C for 10 min. Acetic acid was added to 0.1 M and vortexed, followed by an additional incubation at 90 °C for 10 min before clarification by centrifugation and methanol/chloroform extraction. The pellet was resuspended in 100 mM Tris containing 0.1% sodium deoxycholate with TCEP at 5 mM. Free thiols were capped with Nethylmaleimide. Excess reagent was removed by ultrafiltration with amicon4 10 kDa centrifugal devices. The protein was then quantified and exchanged into 6 M guanidine for digestion overnight by αLP. The digests were quenched by the addition of formic acid to 1%, followed by desalting by seppak (Waters, Milford, MA). Peptides were then fractionated with electrostatic repulsionhydrophilic interaction chromatography^{57}. Fractions were assayed for protein concentration using a BCA assay and pooled into 18 fractions of equal protein concentration, evaporated to dryness and resuspended in 100 μl of 0.2% formic acid. Nano LCMS/MS was performed with a LTQ XL mass spectrometer equipped with ETD. Ten microlitres of each fraction (≈1 μg) was injected onto a 12 cm × 75 μm I.D.C18 column prepared in house and eluted in 0.2% FA with a gradient of 5 to 40% Acetonitrile over 60 min followed by wash and reequilibration totalling 90 min of MS data per run. The flow was split about 1:500 to a flow rate of about 250 nl min^{−1}. A survey scan was followed by datadependent fragmentation of the four most abundant ions with both CID and ETD, with supplemental activation. The maximum MS/MS ion accumulation time was set to 100 ms. Fragmented precursors were dynamically excluded for 45 s with one repeat allowed.
Training scoring parameters
At the beginning of this study, we had five scoring parameter sets used in ref. 12 for the following five spectral types: (CID, Low, Standard, Trypsin), (CID, Low, Standard, LysN), (ETD, Low, Standard, Trypsin), (ETD, Low, Standard, LysN) and (ETD, Low, Standard, LysC). For this study, we constructed 20 new parameter sets using these 5 parameter sets as a starting point, using a newly developed programme called ScoringParamGen within the MSGF+ package. To train scoring parameters for a new spectral type, MSGF+ was run with an existing parameter set to identify PSMs at 1% FDR threshold, and using the identified PSMs as a training set a new parameter set was constructed. Supplementary Fig. S4 shows the scoring parameter sets contained in MSGF+ and how they were constructed. We also tried to construct another generation of parameter sets using the existing parameter sets for the same spectral types, but this ‘iterative training’ hardly changed the number of identified PSMs.
For some data sets, the same data set was used for both training and testing of the performance, raising concerns about overfitting. However, as shown in ref. 12, MSGF+ scoring parameter set characterizes a particular spectral type and is rather stable with respect to specific data sets. For example, for the human (CID, Low, Standard, Trypsin) data set, when the scoring parameter set trained from the same data set was used instead of the data set used in ref. 12, the number of identified PSMs hardly changed.
Additional information
How to cite this article: Kim, S. and Pevzner, P. A. MSGF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 5:5277 doi: 10.1038/ncomms6277 (2014).
References
 1
Swaney, D. L., Wenger, C. D. & Coon, J. J. Value of using multiple proteases for largescale mass spectrometrybased proteomics. J. Proteome Res. 9, 1323–1329 (2010).
 2
Eng, J., McCormack, A. & Yates, J. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
 3
Perkins, D., Pappin, D., Creasy, D. & Cottrell, J. Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).
 4
Cox, J. et al. Andromeda: A peptide search engine integrated into the Maxquant environment. J. Proteome Res. 10, 1794–1805 (2011).
 5
Wenger, C. D. & Coon, J. J. A proteomics search algorithm specifically designed for highresolution tandem mass spectra. J. Proteome Res. 12, 1377–1386 (2013).
 6
Dorfer, V. et al. MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. J. Proteome Res. 13, 3679–3684 (2014).
 7
Sweet, S. M. M. et al. Database search strategies for proteomic data sets generated by electron capture dissociation mass spectrometry. J. Proteome Res. 8, 5475–5484 (2009).
 8
Hsieh, E. J., Hoopmann, M. R., Maclean, B. & Maccoss, M. J. Comparison of database search strategies for high precursor mass accuracy MS/MS data. J. Proteome Res. 9, 1138–1143 (2009).
 9
Keller, A., Nesvizhskii, A., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).
 10
Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & Maccoss, M. J. Semisupervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).
 11
Ma, Z.Q. et al. Idpicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. J. Proteome Res. 8, 3872–3881 (2009).
 12
Kim, S. et al. The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search. Mol. Cell. Proteomics 9, 2840–2852 (2010).
 13
Yates, J. R. et al. Toward objective evaluation of proteomic algorithms. Nat. Methods 9, 455–456 (2012).
 14
Noble, W. S. & Maccoss, M. J. Computational and statistical analysis of protein mass spectrometry data. PLoS Comput. Biol. 8, e1002296 (2012).
 15
Kim, S., Gupta, N. & Pevzner, P. Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. J. Proteome Res. 7, 3354–3363 (2008).
 16
Zhou, J.Y. et al. Simple sodium dodecyl sulfateassisted sample preparation method for LCMSbased proteomics applications. Anal. Chem. 84, 2862–2867 (2012).
 17
Dresang, L. R. et al. Coupled transcriptome and proteome analysis of human lymphotropic tumor viruses: insights on the detection and discovery of viral genes. BMC Genom 12, 625 (2011).
 18
Wang, L. et al. Mapping Nlinked glycosylation sites in the secretome and whole cells of aspergillus niger using hydrazide chemistry and mass spectrometry. J. Proteome Res. 11, 143–156 (2012).
 19
Wrighton, K. C. et al. Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337, 1661–1665 (2012).
 20
Frese, C. K. et al. Improved peptide identification by targeted fragmentation using CID, HCD and ETD on an ltqorbitrap velos. J. Proteome Res. 10, 2377–2388 (2011).
 21
Huttlin, E. L. et al. A tissuespecific atlas of mouse protein phosphorylation and expression. Cell 143, 1174–1189 (2010).
 22
Starita, L. M., Lo, R. S., Eng, J. K., von Haller, P. D. & Fields, S. Sites of ubiquitin attachment in Saccharomyces cerevisiae. Proteomics 12, 236–240 (2012).
 23
Meyer, J. G. et al. Expanding proteome coverage with orthogonalspecificity αlytic proteases. Mol. Cell. Proteomics 13, 823–835 (2014).
 24
Gupta, N., Bandeira, N., Keich, U. & Pevzner, P. A. Targetdecoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 22, 1111–1120 (2011).
 25
Tanner, S. et al. Inspect: identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 77, 4626–4639 (2005).
 26
Kim, S., Gupta, N., Bandeira, N. & Pevzner, P. Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69 (2009).
 27
Geer, L. Y. et al. Open mass spectrometry search algorithm. J. Proteome Res. 3, 958–964 (2004).
 28
Craig, R. & Beavis, R. C. Tandem: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004).
 29
Côté, R. G., Reisinger, F. & Martens, L. jmzMl, an opensource Java API for mzMl, the PSI standard for MS data. Proteomics 10, 1332–1335 (2010).
 30
Griss, J., Reisinger, F., Hermjakob, H. & Vizcano, J. A. jmzReader: A Java parser library to process and visualize multiple text and XMLbased mass spectrometry data formats. Proteomics 12, 795–798 (2012).
 31
Reisinger, F. et al. jmzIdentML API: A Java interface to the mzIdentML standard for peptide and protein identification data. Proteomics 12, 790–794 (2012).
 32
Martens, L. et al. mzMla community standard for mass spectrometry data. Mol. Cell. Proteomics 10, R110.000133 (2010).
 33
Jones, A. R. et al. The mzIdentML data standard for mass spectrometrybased proteomics results. Mol. Cell. Proteomics 11, M111.014381 (2012).
 34
Granholm, V., Noble, W. S. & Käll, L. On using samples of known protein content to assess the statistical calibration of scores assigned to peptidespectrum matches in shotgun proteomics. J. Proteome Res. 10, 2671–2678 (2011).
 35
Howbert, J. J. & Noble, W. S. Computing exact pvalues for a crosscorrelation shotgun proteomics score function. Mol. Cell. Proteomics 13, 2467–2479 (2014).
 36
Jeong, K., Kim, S., Bandeira, N. & Pevzner, P. A. Gapped spectral dictionaries and their applications for database searches of tandem mass spectra. Mol. Cell. Proteomics 10, 002220 (2011).
 37
Noble, W. S. How does multiple testing correction work? Nat. Biotechnol. 27, 1135–1137 (2009).
 38
Elias, J. E. & Gygi, S. P. Targetdecoy search strategy for increased confidence in largescale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).
 39
Jeong, K., Kim, S. & Bandeira, N. False discovery rates in spectral identification. BMC Bioinformatics 13, (Suppl 16): S2 (2012).
 40
Liu, X., Segar, M. W., Li, S. C. & Kim, S. Spectral probabilities of topdown tandem mass spectra. BMC Genomics 15, (Suppl 1): S9 (2014).
 41
Payne, S. H. et al. Phosphorylationspecific MS/MS scoring for rapid and accurate phosphoproteome analysis. J. Proteome Res. 7, 3373–3381 (2008).
 42
Stekhoven, D. J., Omasits, U., Quebatte, M., Dehio, C. & Ahrens, C. H. Proteomewide identification of predominant subcellular protein localizations in a bacterial model organism. J. Proteom 99, 123–137 (2014).
 43
Risk, B. A., Edwards, N. J. & Giddings, M. C. A peptidespectrum scoring system based on ion alignment, intensity, and pair probabilities. J. Proteome Res. 12, 4240–4247 (2013).
 44
Lange, P. F., Huesgen, P. F., Nguyen, K. & Overall, C. M. Annotating N termini for the human proteome project: N termini and Nαacetylation status differentiate stable cleaved protein species from degradation remnants in the human erythrocyte proteome. J. Proteome Res. 13, 2028–2044 (2014).
 45
Deutsch, E. W. et al. A guided tour of the transproteomic pipeline. Proteomics 10, 1150–1159 (2010).
 46
Goecks, J., Nekrutenko, A., Taylor, J. & Team, G. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010).
 47
GonzalezGalarza, F. F. et al. A critical appraisal of techniques, software packages, and standards for quantitative proteomic analysis. OMICS 16, 431–442 (2012).
 48
Vaudel, M., Barsnes, H., Berven, F. S., Sickmann, A. & Martens, L. SearchGUI: An opensource graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics 11, 996–999 (2011).
 49
MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966–968 (2010).
 50
Granholm, V. et al. Fast and accurate database searches with MSGF+Percolator. J. Proteome Res. 13, 890–897 (2014).
 51
Kwon, T., Choi, H., Vogel, C., Nesvizhskii, A. I. & Marcotte, E. M. Msblender: A probabilistic approach for integrating peptide identifications from multiple database search engines. J. Proteome Res. 10, 2949–2958 (2011).
 52
Kroksveen, A. C. et al. Cerebrospinal fluid proteome comparison between multiple sclerosis patients and controls. Acta Neurol. Scand. Suppl. 90–96 (2012).
 53
Edwards, N. PepArML: A metasearch peptide identification platform for tandem mass spectra. Curr. Protoc. Bioinformatics 44, 13.23.1–13.23.23 (2013).
 54
Swaney, D. L., McAlister, G. C. & Coon, J. J. Decision treedriven tandem mass spectrometry for shotgun proteomics. Nat. Methods 5, 959–964 (2008).
 55
Kessner, D., Chambers, M., Burke, R., Agus, D. & Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24, 2534–2536 (2008).
 56
von der Haar, T. Optimized protein extraction for quantitative proteomics of yeasts. PLoS ONE 2, e1078 (2007).
 57
Hao, P. et al. Novel application of electrostatic repulsionhydrophilic interaction chromatography (ERLIC) in shotgun proteomics: comprehensive profiling of rat kidney proteome. J. Proteome Res. 9, 3520–3526 (2010).
 58
van den Toorn, H. W. P. et al. RockerBox: analysis and filtering of massive proteomics search results. J. Proteome Res. 10, 1420–1424 (2011).
Acknowledgements
We thank Jesse Meyer and Elizabeth Komives for providing the αLP data sets, and Nuno Bandeira, Matthew Monroe and Taejoon Kwon for providing feedback to improve MSGF+. This work was supported by the US National Institutes of Health grant 1P41RR024851 from the National Center for Research Resources.
Author information
Affiliations
Contributions
S.K. and P.P. designed the algorithms and the experiments, and wrote the manuscript. S.K. implemented the algorithms and performed the data analysis.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Information
Supplementary Figures 16, Supplementary Tables 12, Supplementary Notes 13 and Supplementary References (PDF 888 kb)
Rights and permissions
About this article
Cite this article
Kim, S., Pevzner, P. MSGF+ makes progress towards a universal database search tool for proteomics. Nat Commun 5, 5277 (2014). https://doi.org/10.1038/ncomms6277
Received:
Accepted:
Published:
Further reading

The Campylobacter jejuni CiaD effector coopts the host cell protein IQGAP1 to promote cell entry
Nature Communications (2021)

Proteoform characterization based on topdown mass spectrometry
Briefings in Bioinformatics (2021)

MultiOmics Driven Metabolic Network Reconstruction and Analysis of Lignocellulosic Carbon Utilization in Rhodosporidium toruloides
Frontiers in Bioengineering and Biotechnology (2021)

Mass SpectrometryBased Structural Analysis of CysteineRich MetalBinding Sites in Proteins with MetaOdysseus R Software
Journal of Proteome Research (2021)

A Review of Imputation Strategies for Isobaric LabelingBased Shotgun Proteomics
Journal of Proteome Research (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.