Journal home
Advance online publication
Current issue
Archive
Press releases
Supplements
Focuses
Conferences
Guide to authors
Online submissionOnline submission
Permissions
For referees
Free online issue
Contact the journal
Subscribe
Advertising
work@npg
naturereprints
About this site
For librarians
 
NPG Resources
Bioentrepreneur
Nature Reviews Drug Discovery
Nature
Nature Medicine
Nature Genetics
Nature Reviews Genetics
Nature Methods
Nature Chemical Biology
news@nature.com
Clinical Pharmacology & Therapeutics
Nature Conferences
NPG Subject areas
Biotechnology
Cancer
Chemistry
Clinical Medicine
Dentistry
Development
Drug Discovery
Earth Sciences
Evolution & Ecology
Genetics
Immunology
Materials Science
Medical Research
Microbiology
Molecular Cell Biology
Neuroscience
Pharmacology
Physics
Browse all publications
Letter
Nature Biotechnology - 24, 1285 - 1292 (2006)
Published online: 10 September 2006; | doi:10.1038/nbt1240

A probability-based approach for high-throughput protein phosphorylation analysis and site localization

Sean A Beausoleil1, Judit Villén1, Scott A Gerber2, John Rush3 & Steven P Gygi1

1 Department of Cell Biology, Harvard Medical School, 240 Longwood Ave., Boston, Massachusetts 02115, USA.

2 Department of Genetics and Norris Cotton Cancer Center, Lebanon, New Hampshire 03755, USA.

3 Cell Signaling Technology, Inc., Beverley, Massachusetts 01915, USA.

Correspondence should be addressed to Steven P Gygi steven_gygi@hms.harvard.edu

Data analysis and interpretation remain major logistical challenges when attempting to identify large numbers of protein phosphorylation sites by nanoscale reverse-phase liquid chromatography/tandem mass spectrometry (LC-MS/MS) (Supplementary Figure 1 online). In this report we address challenges that are often only addressable by laborious manual validation, including data set error, data set sensitivity and phosphorylation site localization. We provide a large-scale phosphorylation data set with a measured error rate as determined by the target-decoy approach, we demonstrate an approach to maximize data set sensitivity by efficiently distracting incorrect peptide spectral matches (PSMs), and we present a probability-based score, the Ascore, that measures the probability of correct phosphorylation site localization based on the presence and intensity of site-determining ions in MS/MS spectra. We applied our methods in a fully automated fashion to nocodazole-arrested HeLa cell lysate where we identified 1,761 nonredundant phosphorylation sites from 491 proteins with a peptide false-positive rate of 1.3%.

Large-scale experiments focused on the identification of protein phosphorylation sites by LC-MS/MS1, 2, 3, 4, 5, 6, 7, 8, 9, 10 (Supplementary Fig. 1 online) rely heavily on manual validation to control for error within a data set. However, this has become more and more impractical as data sets have grown progressively larger. As an alternative, searching against a composite target/decoy database containing all protein sequences in both forward and reverse orientations (Fig. 1a)11, 12, 13 provides a simple and effective way to estimate the error rate of peptide spectral matches (PSMs). The target/decoy strategy is based on the principle that incorrect matches have an equal probability of being derived from either the target or the decoy database. To test the suitability of this approach as it applies to more complex human phosphorylation data sets, we evaluated the likelihood of choosing an incorrect PSM from both the target and decoy orientations of the composite database. This was achieved by examining the Sequest output for positions 1–10 for all spectra from a single LC-MS/MS run searched with a variable modification for phosphorylation (for clarity only one phosphopeptide species was allowed to exist per spectrum). Position 1 was highly enriched for target hits, whereas positions 2–10 exhibited an equal percentage of target and decoy hits (Fig. 1b). This was to be expected because correct PSMs, which are exclusively derived from the target database, should be found with the greatest frequency in the top-ranking position (Fig. 1b). In contrast, positions 2–10 should have an equal number of target and decoy hits because incorrect PSMs have an equal chance of being derived from either the target or decoy database (Fig. 1b). To further test this principle, we created two data sets of entirely falsified spectra based on real data. The first data set was created by modifying each MS/MS spectrum by adding 10 m/z units to every peak in every spectrum. The second data set was created by increasing the precursor ion mass of each MS/MS spectrum by 100 p.p.m. As expected, the Sequest output for both data sets resulted in an equal number of target and decoy hits at all (1–10) positions (Fig. 1c,d), further supporting the principle that incorrect spectra have an equal chance of being derived from either the target or decoy database. If one uses this search strategy, it should be possible to establish a data set with a low false-positive rate (for example, 1%) through post-search filtering with easily accessible criteria (for example, tryptic state, mass deviation, Sequest XCorr, etc.).

Figure 1. Composite target/decoy database searching strategy provides an accurate estimate of false-positive rates for large data sets by knowingly distracting fifty percent of the error.
Figure 1 thumbnail

(a) A composite database composed of normal (target) and reversed (decoy) protein sequences is created for searching MS/MS spectra. By definition, 100% of correctly assigned spectra should be derived from the forward database, whereas incorrectly assigned (random) spectra should have an equal chance of being derived from either the forward or reversed database. Filtering of the entire data set to enrich for correct matches provides a final list in which the false-positive rate (and many other parameters16 can be estimated based on the number of reversed hits. (b) Example of Sequest assignments for a single LC-MS/MS phosphorylation analysis (4,909 MS/MS). The number of forward and reversed assignments from the top 10 Sequest hits for each spectrum is shown (for clarity only one phosphopeptide species was allowed per spectrum). To demonstrate the principle of decoy distraction, all matches are shown and no filtering of any kind was performed. First place hits were greatly enriched in favor of the forward database whereas positions 2–10 (incorrect assignments) represented an equal number of forward and reverse hits. (c) Same analysis as in panel b, but each peak in every MS/MS spectrum was shifted by 10 m/z units to create a falsified data set. This resulted in an even splitting of forward and reversed matches at all hit numbers. (d) Same analysis as in b, but every precursor ion mass for every MS/MS spectrum was shifted by 100 p.p.m. to create a falsified data set. This also resulted in an even splitting of forward and reversed matches at all hit numbers. These experiments validate the use of a reversed-sequence database as a suitable decoy database.



Full FigureFull Figure and legend (93K)
A high degree of data set precision is paramount in proteomics experiments, but it should be achieved with a minimal loss of sensitivity. Reaching the desired precision for an experiment is accomplished through stepwise filtering aimed at reducing the frequency of incorrect PSMs. However, maximizing sensitivity is particularly problematic in large-scale phosphorylation experiments because certain filtering strategies aimed at reducing false positives are less effective relative to standard proteomics experiments. For example, phosphopeptide Sequest XCorr scores are often suppressed and score similarities are often observed when sequencing phosphopeptides by tandem mass spectrometry14. To address these limitations, one must define an effective search space to help distinguish correct PSMs from incorrect PSMs using alternative filtering criteria including mass accuracy and tryptic state.

In this experiment, nocodazole-arrested HeLa cells were lysed, and protein was separated by SDS-PAGE. A total of six gel regions were excised and digested with trypsin. Phosphopeptides were enriched by strong cation exchange (SCX) chromatography with fraction collection where a difference in solution charge states causes most phosphopeptides to elute early in the gradient2 (Fig. 2a). Four fractions representing the initial one-third of the gradient were analyzed by LC-MS/MS for each gel region (24 samples in total for the six bands) (Fig. 2a). The four LC-MS/MS analyses from each gel region were pooled (Fig. 2b) and searched against the target/decoy database. To evaluate a useful search space, we performed several different searches to determine the effect mass tolerance and enzyme specificity have on the distribution of incorrect PSMs. We searched the data set from band A (approx20,000 spectra) with 10, 50 and 100 p.p.m. precursor ion mass tolerances and no enzyme specificity, partially tryptic enzyme specificity and fully tryptic enzyme specificity (Supplementary Fig. 2). Searches against the data set from band A with smaller mass tolerances (for example, 10 p.p.m.) made it slightly more difficult to distinguish correct from incorrect PSMs when compared to larger mass tolerances filtered by mass deviation. Increasing the search tolerance to 50 p.p.m., however, allowed incorrect PSMs to distribute over a mass deviation of plusminus 50 p.p.m. whereas correct PSMs distributed in accordance with the mass accuracy obtained in this experiment (8 p.p.m. window). Likewise, relaxing enzyme constraints to partial-tryptic specificity had a similar effect, distributing incorrect PSMs between different partially and fully tryptic states. For the band A data set, the search parameters that resulted in the greatest number of correctly assigned PSMs were plusminus 50 p.p.m. with partial enzyme specificity.

Figure 2. Establishing a low false-positive rate for large-scale phosphorylation data sets.
Figure 2 thumbnail

(a) Four milligrams of nocodazole-arrested HeLa cell lysate were separated by SDS-PAGE, excised into six regions (A–F), digested with trypsin and enriched for phosphopeptides by strong cation exchange (SCX) chromatography. Early-eluting fractions were analyzed by LC-MS/MS and searched for phosphorylation against a composite target/decoy database using Sequest with a 50 p.p.m. precursor-ion tolerance and partially tryptic enzyme specificity as shown. (b) Examples of base-peak chromatograms for all four early-eluting SCX fractions from gel region A. This represented one-sixth of the entire experiment as six regions were analyzed. Approximately 20,000 MS/MS spectra were collected in these four 1-h analyses. (c) For the four analyses of Band A, the composite target/decoy searching strategy allowed the determination of a false-positive rate during filtering of the entire data set with specific criteria. Applying powerful constraints such as mass deviation (p.p.m.) and enzyme specificity (tryptic termini) individually removed >90% of false-positive identifications and combined provided a low false-positive rate. Only nominal further filtering with an XCorr of >1.4 and a solution charge state less than or equal to1 were required to achieve a 1% false-positive rate. The final list (766 peptides) contained an estimated 16 false positives (8 decoy and 8 target hits). (d) Effect of mass deviation as a filter for removing false-positive identifications. All matched tryptic phosphopeptides are shown from Band A (see Fig. 3c, enzyme only column). Correct identifications distribute within an 8 p.p.m. window and an XCorr > 1.4 (boxed). False-positive identifications distribute evenly throughout the entire 50 p.p.m. window.



Full FigureFull Figure and legend (90K)
The process of filtering data from band A searched with plusminus 50 p.p.m. and partial enzyme specificity is illustrated in Figure 2c. Simply requiring a phosphorylation site on a PSM reduced the initial data set by 6,000, but still resulted in an error rate >90%. Powerful individual constraints such as mass deviation (requiring an 8 p.p.m. window) or requiring fully tryptic peptides each resulted in massive reductions in the data set size of nearly 90%, but error rates remained high at approx50%. Combining mass deviation and fully tryptic filters, however, resulted in a substantial increase in precision without compromising sensitivity. To achieve a final data set of 750 phosphopeptides from band A required only nominal additional filtering using an XCorr >1.4 and requiring a solution charge state of less than or equal to1. As an illustration of the use of mass deviation to remove false-positive identifications, Figure 2d shows the distribution of all target and decoy hits for all spectra matching phosphopeptides from Band A that contained fully tryptic termini. There were 626 predicted false-positive and 836 predicted true positive (TP) identifications (1,462 total matches) at this point in filtering the data set (see Fig. 2c, enzyme only column). Even though the searches were performed at a precursor tolerance of 50 p.p.m., nearly all correct PSMs were found within an 8 p.p.m. window and an XCorr > 1.4. A similar filtering process was followed for each gel region in this experiment to maximize precision and sensitivity (Supplementary Table 1).

Precise phosphorylation site localization can be difficult when multiple serine, threonine and tyrosine residues exist within a single peptide (Fig. 3a). For ambiguity between potential phosphorylation sites to be resolved, fragment ions exclusive to a specific site location must be identified to uniquely assign a site to a specific residue. We refer to these specific fragment ions as 'site-determining ions.' We have developed an automated approach to identify phosphorylation site location by (i) determining the most likely phosphorylation site candidates (Fig. 3b), and (ii) calculating the probability of correct phosphorylation site location based only on the likelihood of identifying site-determining ions compared to random chance (Fig. 3c).

Figure 3. Resolving ambiguity in phosphorylation site localization.
Figure 3 thumbnail

(a) Peptides containing multiple serine, threonine and/or tyrosine residues should be evaluated for precise site assignment. This phosphopeptide is from Zinc finger protein 638. (b) General scheme for calculating a probability-based ion matching score (Peptide Score) for each potential phosphorylation site. The tandem mass (MS/MS) spectrum for the phosphopeptide from panel a is shown. The spectrum was separated into 100 m/z windows where the top N most-intense peaks per window were matched to predicted b- and y-type ions for each possibility. This was repeated using from 1 to 10 ions in each window (6 is shown). The cumulative binomial probability P was calculated using the number of trials (all b- and y-type ions) and the number of successes (matched ions) for each possibility and plotted as -10 times log (P) vs peak depth (peaks per 100 m/z). The peptide corresponding to the red line matched more ions at every peak depth than any other possibility. The actual ambiguity score (Ascore) for this peptide is calculated using only site-determining ions as shown in Figure 3c using information from this plot. (c) The Ascore is a probability-based metric that measures the likelihood that a difference in site-determining ions between two site positions was matched by random chance. In this example, only six b- or y-type ions could potentially differentiate the two phosphorylation sites. A peak depth of six was determined from Figure 3b as the earliest maximal difference in the number of matched ions. The cumulative binomial probability was applied as in Figure 3b but using only the site-determining subset of ions. An Ascore of 53.57 would represent a probability of less than 1 in 200,000 of matching a difference of at least 5 ions in 6 trials by random chance. Any of these 5 ions, if not due to chance, can differentiate between the two potential sites.



Full FigureFull Figure and legend (98K)
Figure 3a shows a candidate phosphopeptide containing multiple possibilities for phosphorylation site location. The MS/MS spectrum for this peptide was first separated into windows of 100 m/z units. Within each window, only the top i peaks were retained by intensity, where i represented the peak depth. Predicted b- and y-type ions for each possibility were then overlaid with the processed spectrum. The cumulative binomial probability P was calculated using the number of trials N, the number of successes n, and the probability of success p as follows:



where P represents the probability of randomly matching at least the given number of fragment ions to the MS/MS spectrum. The total number of trials (N) equaled the total number of fragment ions for the given peptide. The total number of successes (n) equaled the number of ions matched to the spectrum. Within a given window, the probability of matching a peak (p) was equal to i/100. For example, where i = 1, p = 0.01 with an ion tolerance of plusminus0.5 m/z. A human readable score was calculated by multiplying -10 by the log(P). This entire process was repeated for i + 1, while i less than or equal to 10. Scores were then plotted as shown in Figure 3b for each possible phosphorylation site. A weighted average of all ten scores is called the peptide score.

For precise phosphorylation site assignment, the cumulative binomial probability of identifying site-determining b- and y-type ions was calculated for the two highest-scoring site locations (Fig. 3c). The process was applied at the earliest peak depth that represented the maximum difference between the two highest-scoring site locations determined by using the Peptide Score as described above. In the example shown in Figure 3b, the earliest maximal peak depth was 6. The cumulative binomial probability for matching only the site-determining ions was calculated using the same method outlined above with one exception: the total number of trials N was equal to the total number of site-determining ions. The probabilities for the top two candidates were converted into human readable scores and subtracted from each other. The resultant score is a metric that measures the likelihood of matching a difference of at least the number of matched site-determining ions by chance from the top two candidate sites and has been termed the ambiguity score (Ascore). An Ascore of 20 (P = 0.01) should result in the site being localized with 99% certainty.

To validate the Ascore, we analyzed six data sets (>3,500 MS/MS spectra) from phosphopeptides with known phosphorylation site locations. Each data set was required to contain more than one possible site of phosphorylation, with an average of 3.6 phosphorylatable residues per peptide (Supplementary Fig. 3). These data were generated from three separate phosphopeptide libraries, three antibody immunoprecipitation experiments, and a previously published data set4. For these data sets, neither Sequest nor Mascot was able to localize 100% of the sites correctly, and neither algorithm provided evidence to suggest which peptides were not correctly localized, demonstrating the need for additional scoring criteria. Sequest success rates ranged from 75% to 98% (Fig. 4a), and Mascot success rates ranged from 75% to 99% (Fig. 4a). We then evaluated the Ascore method and compared the results to Sequest and Mascot. We found that at every degree of precision for correct phosphorylation site localization, the Ascore provided an increase in sensitivity over Sequest (Fig. 4b) and Mascot (Fig. 4c).

Figure 4. Sequest and Mascot can fail to provide proper phosphorylation site placement.
Figure 4 thumbnail

(a) Six data sets containing more than 3,500 known localized sites were examined for correct site localization by Sequest, Mascot and the Ascore. Sites were known based on using antibody phosphopeptide immunoprecipitations with known motifs, by analyzing synthetic phosphopeptide libraries, or by using previously published data. The number of phosphopeptides with the correct site localization varied between 75% and 98% for Sequest no. 1 hits and 75–99% for Mascot no. 1. Greater than 99.8% of peptides with Ascore greater than or equal to19 were localized correctly irrespective of the data set. pS, phospho-serine; pY, phospho-tyrosine; and pT, phospho-threonine. (b) Precision and sensitivity curve comparison of Sequest dCn (red) XCorr (green) and Ascore (blue) for all combined data sets of known phosphorylation sites. At 99% precision, the Ascore confidently assigned phosphorylation sites at a twofold higher sensitivity than Sequest dCn, and was more sensitive than Sequest scoring at every precision interval. (c) Precision and sensitivity curve comparison of Mascot delta-ions score (red) ion score (green) and Ascore (blue) for all combined data sets of known phosphorylation sites. At 99% precision, the Ascore confidently assigned phosphorylation sites at a fourfold higher sensitivity than Mascot delta-ions score, and was more sensitive than Mascot scoring at every precision interval. (d) Distribution of Ascore values for all nonredundant phosphopeptides identified in nocadozole-arrested HeLa cell lysate.



Full FigureFull Figure and legend (73K)
We next attempted to reach >99% certainty with Sequest or Mascot by taking the first place selection and applying additional filtering criteria to the combined data sets of known phosphorylation sites. To achieve >99% certainty with Sequest required the use of a strong delta correlation filter (greater than or equal to0.15). At the same confidence level, the Ascore method showed a substantial improvement, localizing twofold more phosphopeptides (Fig. 4b and Supplementary Fig. 4). Furthermore, the Sequest delta correlation filter failed to clearly define those phosphopeptides that were improperly localized or lacked proper localization information. As expected, we also found that XCorr did not have a substantial effect on phosphorylation site localization (Supplementary Fig. 4). For Mascot, we found that there was no defined metric that could reach >99% certainty for phosphorylation site placement. As an alternative approach, we created a normalized delta-ions score13 by taking the difference in the Ions score for the top two ranking peptides and dividing that difference by the first ranking peptide's Ion score. To reach >99% certainty required a delta-ions score of greater than or equal to0.4. At the same confidence level, the Ascore method showed a substantial improvement, localizing 4.1-fold more phosphopeptides (Fig. 4c and Supplementary Fig. 5).

To illustrate the composition of the amino acids in each of the known data sets, we used sequence logos (http://weblogo.berkeley.edu/), with the phosphorylated residue shown in orange, and the frequency of the different amino acids flanking the site of phosphorylation directly proportional to its height (Supplementary Fig. 3a). In addition, the percentage of phosphorylatable residues flanking the site of phosphorylation is also shown (Supplementary Fig. 3b).

Unlike any other sequencing algorithm, calculating an Ascore for each phosphorylation site and evaluating several different threshold values made it possible to accurately characterize every phosphorylation site in any data set in terms of phosphorylation site localization and certainty of site assignment. Importantly and as predicted, phosphopeptides with an Ascore greater than or equal to19 always produced >99% certainty for correct phosphorylation site localization regardless of the data set (Fig. 4) and represented a substantial improvement over both Sequest and Mascot at the same confidence level (Fig. 4b,c). Furthermore, phosphopeptides with Ascores of 15–19 achieved >90% success rate for correct phosphorylation site localization (Fig. 4a). Those phosphopeptides with Ascores between 3 and 15 had a success rate near 80%, but lacked sufficient site-determining ions to unequivocally assign the phosphorylation site. Lastly, those phosphopeptides with Ascores <3 contained little or no site-determining information necessary for proper phosphorylation site placement.

We next applied the methods described here to the analysis of protein phosphorylation from nocodazole-arrested HeLa cell lysate. A total of 126,162 MS/MS spectra were collected over a 24-h period from 24 samples. In addition, MS/MS/MS (MS3) spectra were collected (1,645) when a neutral loss of 49 m/z was observed within the two most intense peaks4. Each of the 24 samples was combined and filtered (XCorr, solution charge, p.p.m., tryptic state) by gel region (Supplementary Table 1) resulting in the identification of 3,748 phosphorylation sites from 2,836 phosphopeptides. One-fifth (21%) of all phosphorylation sites (800) were identified with the aid of MS3 spectra. Ascore values were generated for each phosphopeptide in an automated fashion using in-house software. A total of 1,079 phosphorylation sites had Ascore values greater than or equal to 19, 148 phosphorylation sites had values of 15–19, 407 phosphorylation sites had values of 3–15 and 127 phosphorylation sites had values <3 (Fig. 4d). After accounting for redundancy and removing known false positives (decoy identifications), a total of 1,761 nonredundant phosphorylation sites were identified from 1,289 peptides, which translated into a total of 491 proteins with an estimated false-positive rate of 1.3% at the peptide level (Table 1). Because 82% of the nonredundant phosphorylation sites found from MS3 spectra were also found from MS/MS spectra, 8.7% of all nonredundant sites were contributed solely by MS3 spectra (154 sites). Most phosphopeptides (87%) were derived from the first three (highest molecular weight) gel regions.

Table 1. Summary of all identified phosphorylation events in nocadozole-arrested HeLa cell lysate
Table 1 thumbnail

Full TableFull Table
To characterize the phosphorylation sites identified in this experiment, we used a recently described bioinformatics approach to define possible phosphorylation motifs in our data set15 (Supplementary Fig. 6). Three-fourths of the confidently localized sites contained a proline-directed phosphorylation motif. Of the 491 proteins identified, 362 were found to function in known biological processes as determined by the gene ontology program PANTHER (http://www.pantherdb.org/) (Supplementary Fig. 7). PhosphoSite (http://www.phosphosite.org/) was used to distinguish known from novel phosphorylation sites. Some examples of known phosphorylation sites included, T14 and Y15 from CDK2, S377 from MAP-kinase family member JNK1, S373 and S377 from APC1, S260 from APCCDC16, S582 from APCCDC23 and S446 from APCCDC27. In addition we found numerous novel phosphorylation sites. For example, T187, T190 and S192 or S195 from WEE1, S220 from PITSLRE (human) a protein kinase of the CDK family, S507 from MLK3 (MAPKKK11), T321 from Casein Kinase I alpha, S248 from the C terminus of 14-3-3 sigma, and T1041 or S1042 in the kinase domain of BUB1 beta.

Balancing precision and sensitivity is often a major consideration in large-scale proteomics experiments16. Moreover, large-scale post-translational studies on phosphorylation present a unique set of challenges. Identified phosphoproteins are often derived from single-peptide hits, and common filtering criteria such as Sequest XCorr and dCn scores become less effective because of score suppression and score similarities associated with sequencing phosphopeptides by tandem mass spectrometry. These shortcomings make laborious manual interpretation necessary to improve data set integrity and to maximize sensitivity. In this report, we have addressed both issues with computational tools. The target/decoy database searching approach provided an estimate of the false-positive rate even with complex human phosphorylation data. Furthermore, the issue of sensitivity was addressed with the use of a carefully determined search space to take advantage of available filter criteria. Proper search space selection in combination with the availability of high mass-accuracy data made it possible to distinguish correct from incorrect PSMs with only modest XCorr filters and no dCn filters, helping to rescue many low scoring but correct PSMs while maintaining an acceptable error rate. Although it is possible that some correct PSMs may be discarded using this strategy, the increased sensitivity compensated for any loss (Supplementary Fig. 2).

In addition to precision and sensitivity within large-scale phosphorylation data sets, proper phosphorylation site location is also critical because many biological processes are regulated through the phosphorylation of specific residues17, 18, 19. To improve and automate proper phosphorylation site placement, we have developed a probability-based approach to predict the likelihood of matching site-determining ions to specific phosphorylation site locations. In contrast to sequencing algorithms, the Ascore method has the distinct advantage of calculating a localization-specific probability for every phosphorylation site within a data set. Upon analysis, the user has information on precisely which peptides were localized, and at what degree of certainty. Users are also provided with information on which peptides were not localized because of insufficient site-determining ions. In addition, the Ascore method provides a sensitive approach to resolve proper phosphorylation site placement because it measures the differences in site placement at the level of the site-determining ions and not at the peptide level. As a result, the Ascore method provides a more thorough and direct analysis of phosphorylation site placement than more general sequencing algorithms. For example, the Ascore method was able to confidently localize (99% certainty) two- to fourfold more phosphorylation sites in a data set of known phosphorylation sites than Sequest or Mascot scoring (see Fig. 4 and Supplementary Figs. 4,5). Finally, the Ascore method provides a strategy for phosphorylation site assignment that is entirely statistically based, free from potential inconsistencies that can result from manual validation.

To evaluate the methods described here we analyzed protein from nocodazole-arrested HeLa cells which had been enriched for phosphorylation by SCX chromatography before LC-MS/MS analysis. Using the target/decoy approach and an effective search space, we were able to generate a data set with 1,761 nonredundant phosphorylation sites, of which 62% were localized with >99% certainty without the need to manually interpret a single spectrum. We have provided a list of all of the identified phosphopeptides complete with Ascore and MS/MS and/or MS3 spectra available in Supplementary Table 2. Finally, users can access the Ascore for their own data by uploading MS/MS spectra (http://Ascore.med.harvard.edu).

Methods
HeLa cell lysate preparation, SDS-PAGE and in-gel proteolysis.
Protein (4 mg) from nocadozole-arrested HeLa cell lysate19 was separated by a preparative 10% SDS-PAGE gel. The gel was stopped when the buffer front reached 5 cm and lightly stained with Coomassie blue. The entire gel was cut into six regions (Supplementary Fig. 8), diced into small pieces and placed in 15-ml falcon tubes. In-gel digestion with trypsin proceeded as described4. Extracts were completely dried in a speed-vac and stored at -20 °C.

Strong cation exchange (SCX) chromatography.
Extracted peptides were separated by SCX as previously described3, 4 using a Polysulfoethyl Aspartamide (5-mum beads, 200 Å) column (3 times 200 mm) from PolyLC. Early-eluting SCX fractions from minutes 2–18 for each gel region were collected, combined into four samples, desalted offline using C18 SPE columns (Vydac) and completely dried.

Mass spectrometry.
LC-MS/MS was performed using the LTQ FT (ThermoElectron) hybrid linear ion trap–7-Tesla Fourier transform ion cyclotron resonance (FT-ICR) mass spectrometer20, 21 (ThermoElectron). The 24 samples were loaded for 15 min using a Famos autosampler (LC Packings) onto a hand-poured fused silica capillary column (125 mum internal diameter times 18 cm) packed with Magic C18aQ resin (5 mum, 200 Å) using an Agilent 1100 series binary pump with an in-line flow splitter. Chromatography was developed using a binary gradient at 400 nl/min of 5–32% solvent B for 35 min (Solvent A, 0.25% formic acid (FA); Solvent B, 0.1% FA, 97% acetonitrile). Ten MS/MS spectra were acquired in a data-dependent fashion21 from a preceding Fourier transform mass spectrometry (FTMS) master spectrum (400–1,800 m/z at a resolution setting of 105) with an automatic gain control (AGC) target of 3 times 106. Charge-state screening was used to reject singly charged species, and a threshold of 1,500 counts was required to trigger an MS/MS spectrum. If a precursor loss of 49 m/z was observed within the two most intense peaks with a threshold of 700 counts within the MS/MS spectrum, an MS3 scan was triggered4. When possible, the LTQ and FT-ICR were operated in parallel processing mode.

Database searching and data processing.
To fully utilize high mass accuracy FTMS master spectra, we created in-house software to extract precursor charge-state and monoisotopic mass from isotopic envelope information. To enhance isotopic envelope accuracy, we determined the presence of the precursor ion from five FTMS spectra upstream and downstream of the master spectrum used for MS/MS. A weighted average of the consecutive spectra containing the precursor ion was then used to extract the charge-state and monoisotopic mass information for database searching.

MS/MS spectra were searched using the Sequest-Sorcerer algorithm (Sage-N-Research) against either a composite database containing the human IPI protein database (ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/; downloaded 12/23/2004, 60,245 proteins) and its reversed complement, or the human IPI protein database alone with enabled software, which reversed peptides using Sequest-Sorcerer on-the-fly. We received early access to this latter option, which is included in the current Sequest-Sorcerer version. Unless otherwise stated, search parameters included partially tryptic specificity, a mass tolerance of plusminus50 p.p.m., a static modification of 57.0214 on cysteine and dynamic modifications of 79.9663 on serine, threonine and tyrosine, and 15.9949 on methionine. Mascot searches (Unix 2.1) were performed using the same databases, tolerances and modifications but were searched fully tryptic because partially tryptic searches were less sensitive16. High mass accuracy precursor ions for MS3 spectra were created by subtracting 97.9763 Da from the corresponding MS2 MH+ value. MS3 spectra were searched the same way on a 19-node Linux cluster running Sequest (version 27 rev 12) with an additional dynamic modification of -18.0106 on serine and threonine residues.

To take advantage of data generated by SCX chromatography, we used peptide solution charge as a filter. Solution charge can be defined as the sum of all charges on a peptide at pH 2.7 (ref. 4). Specific XCorr, p.p.m. and solution charge cutoffs were empirically determined for each gel region (a combination of four LC-MS/MS runs per gel region) to maximize the number of accepted PSMs, while maintaining a combined error rate of approx1% for the entire data set. For low molecular weight gel regions (E and F), slightly higher false-positive rates were tolerated without dramatically affecting the overall false-positive rate. Ascores were calculated for each PSM in the data set in batch format using in-house software (Fig. 3).

The 1,761 phosphorylation sites detected in this paper came from both localized (Ascore >19; 1,079 sites) and ambiguous (Ascore < 19; 682 sites) phosphopeptides. Ambiguous sites still contained viable and identified phosphopeptides, but site localization was at lower certainty levels. In these cases, we were careful to never allow an ambiguous site to count for more than one site regardless of the number of MS/MS spectra or potential site localizations for this peptide.

MS3 spectra were also collected and used when applicable. However, we found that the vast majority of phosphopeptides were successfully found within corresponding MS/MS spectra. As a result, in this experiment, MS3 PSMs mainly functioned to provide an added degree of confidence in sequence assignment. Surprisingly, MS3 spectra also provided very little additional information with respect to proper phosphorylation site assignment when comparing respective Ascores from MS/MS spectra probably because of reduced ion statistics in many MS3 spectra (data not shown).

Data sets of known phosphorylation sites.
A total of six data sets were created to evaluate the Ascore algorithm. Three synthetic tryptic phosphopeptide libraries were generated containing approx2,000 phosphopeptides in each with the following sequences: GpSPXPXAXFEA(K/R) (lib1), GAPXPXpSXFEA(K/R) (lib2) and ADZZSpSTZZFEAK (lib3), where X was ADEFGLSTVY, Z was SDLFGHP and pS was phosphoserine. Each library was analyzed by the same mass spectrometry methods described above. Because of differences in coupling efficiency, all 2,000 peptides in each library were not equimolar. Approximately 50 pmol of each library was analyzed. Mascot and Sequest searches were done as described above by appending the sequences for all 2,000 phosphopeptides to the Escherichia coli database (4,334 proteins) and then reversing this database to create a composite target/decoy database. Database matches were filtered with criteria (for example, XCorr, Ion score, mass accuracy, tryptic state) to contain no decoy identifications. In total, more than 800 sites were identified and are included in Supplementary Table 3.

Three immunoprecipitation data sets were also generated using motif antibodies specific for either phosphotyrosine (pY), PXpSP or PXpTP where X is any amino acid and pS and pT are phosphoserine and phosphothreonine. All antibodies and immunoprecipitations were provided by Cell Signaling Technology. For each analysis, 2 times 108 Jurkat cells were grown, treated with pervanadate and harvested as described8. The three immunoprecipitation samples were analyzed using the mass spectrometry methods described. High mass accuracy was not used for the PXpTP and PXpSP samples. For simplicity, the PXpTP and PXpSP data sets were combined. Database searches (Mascot and Sequest) were done against the same human target/decoy database and filtered with criteria (XCorr, Ion score, tryptic state) to contain no decoy identifications. In addition, every identification contained the minimum consensus sequence (XYX or PX[ST]P) for the antibody but not necessarily localized correctly. In total more than 600 pY sites and 200 PX[pSpT]P sites were identified and are included in Supplementary Table 4.

The final (6th) data set of known sites was taken from a previously published data set where manual validation was used4. Only PSMs from MS2 spectra were used where the site of phosphorylation was deemed correctly localized.

Importantly, all Sequest or Mascot results from all six data sets of known sites were (i) filtered with criteria to contain no reversed sequences, and (ii) filtered to assure that at least two phosphorylatable residues were present in every sequence to challenge the Ascore algorithm. For Sequest, the total number of identified phosphopeptides was 3,805 (pY 658, lib1 442, lib2 501, lib3 716, PX [pSpT]P 227, Beausoleil04 1,098 singly phosphorylated and 163 multiply phosphorylated). For Mascot the total number of identified phosphopeptides was 3,927 (pY 778, lib1 470, lib2 544, lib3 858, PX[pSpT]P 175, Beausoleil04 997 singly phosphorylated and 105 multiply phosphorylated).

Ascore calculation.
The Ascore uses only site-determining fragment ions present in an MS/MS spectrum to create a probability-based score that a site is correctly localized. The score is based on random sampling and the cumulative binomial distribution. Because the Ascore is also a difference score, the two most likely phosphorylation sites must first be identified. Although this could be done by simply taking the first and second place matches by Sequest or Mascot, we implemented a Peptide Score to also determine the peak depth that should be used for the analyses (Fig. 3). In effect, two scores were created for every phosphopeptide (Peptide Score and Ascore). MS/MS spectra were preprocessed only to contain i peaks per 100 m/z units, where i was the peak depth and was incremented from 1 to 10. For example, a peak depth of 5 meant that only the five most intense peaks were retained within each 100 m/z window. In processing spectra, precursor ion-specific losses were removed (water and phosphoric acid) because they added no site information. In addition, only one peak per isotopic cluster was kept, and only one peak per one m/z unit was allowed. For matching fragment ions in both the Peptide Score and the Ascore, singly charged fragment ion m/z ratios were predicted (m/z < 2000). Additional charge states were also predicted if the precursor ion was triply charged (2+ predicted) or quadruply charged (2+ and 3+ predicted).

The Peptide Score was created as described in the Results section for Figure 3b. All predicted b- and y-type ions for every possible site localization were overlaid with the processed MS/MS spectrum. The cumulative binomial probability P was calculated given the numbers of trials (b- and y-type ions applied) and successes (matching b- or y-type ions) and the probability of a random match, given by the peak depth (number of peaks) per 100 m/z units. A peak depth from 1 to 10 was considered for each analysis. The fragment ion tolerance for a match was set to plusminus0.6 m/z units. The ten scores for every site permutation were transformed to -10 times Log(P) and then plotted (Fig. 3b). The final Peptide Score for a peptide was the weighted average for each score at each peak depth (peak 1 = 0.5; 2 = 0.75; 3 = 1; 4 = 1; 5 = 1; 6 = 1; 7 = 0.75; 8 = 0.5; 9= 0.25; 10 = 0.25). The two highest-scoring permutations for each phosphorylation site were then used to create the Ascore.

The Ascore calculation (Fig. 3c) is essentially the same as for the Peptide Score, but (i) uses only site-determining ions, (ii) is performed at the earliest peak depth that represented the maximum difference between the two highest-scoring locations and (iii) is a difference score. The Ascore then is a metric that measures the likelihood of matching at least the difference in the number of site-determining ions measured by chance from the top two candidate sites. In this study, Ascore values greater than or equal to19 (P < 0.01) were considered localized with near certainty (>99%).

If more than one phosphorylation site existed, an individual Ascore was calculated for each site. The initial testing of the Ascore and Peptide Score algorithms included the utilization of neutral loss events from b- and y-type fragments ions. This included losses of both water/ammonia and phosphoric acid. Although these additional ions are sometimes present, taking them into consideration greatly reduced the Peptide Score and generally reduced the Ascore values in the data sets examined because neutral loss events were not consistently observed from fragment ions and when observed were often at lower peak depths (data not shown).

Motif analysis.
Phosphopeptide sequences were submitted to the Motif-X algorithm15 (http://motif-x.med.harvard.edu/) for the discovery of phosphorylation motifs present in our data set. The Human IPI database was used as a background. Sequences were centered on each phosphorylation site and extended to 13 amino acids (plusminus6 residues). Only those sites with Ascore > 15 were used. Sites that could not be extended because of N or C termini were excluded by the Motif-X algorithm. The significance threshold was set to P < 10-6. The minimum number of motif occurrences was set to approx2% of the entire number of phosphorylations found for each residue. The vast majority of phosphorylation sites in our data set contained the proline-directed motif, pTP or pSP, which was expected because many proline-directed kinases are active during prometaphase. Surprisingly, substantially fewer acidic phosphorylation sites were found when compared to other large-scale experiments3, 4.

Software.
The determination of charge-state and monoisotopic mass was calculated using a perl script that runs on the Linux operating system. To access MS and MS/MS data, .RAW files were converted to the open-raw format using the program xr2or (http://arep.med.harvard.edu/OpenRaw/). The software to calculate the Ascore was created in PHP also under the Linux operating system. This software allowed the batch calculation of entire data sets in a fully automated process using the methods described in the results. Spectra for Ascore determination can also be submitted (http://Ascore.med.harvard.edu).

Note: Supplementary information is available on the Nature Biotechnology website.

Author contributions
S.A.B. conducted all experiments, carried out algorithm development and implementation, and data analysis. J.V. and S.A.G. provided analytical expertise for SCX chromatography and mass spectrometry. J.R. provided synthetic peptide libraries and immunoprecipitation data. S.P.G. provided overall experimental design and support.

 Top
Received 25 May 2006; Accepted 13 July 2006; Published online: 10 September 2006.

REFERENCES
  1. Kim, J.E. , Tannenbaum, S.R. & White, F.M. Global phosphoproteome of HT-29 human colon adenocarcinoma cells. J. Proteome Res. 4, 1339–1346 (2005). | Article | PubMed | ISI | ChemPort |
  2. Cantin, G.T. , Venable, J.D. , Cociorva, D. & Yates, J.R., III . Quantitative phosphoproteomic analysis of the tumor necrosis factor pathway. J. Proteome Res. 5, 127–134 (2006). | Article | PubMed | ChemPort |
  3. Ballif, B.A. , Villen, J. , Beausoleil, S.A. , Schwartz, D. & Gygi, S.P. Phosphoproteomic analysis of the developing mouse brain. Mol. Cell. Proteomics 3, 1093–1101 (2004). | PubMed | ISI | ChemPort |
  4. Beausoleil, S.A. et al. Large-scale characterization of HeLa cell nuclear phosphoproteins. Proc. Natl. Acad. Sci. USA 101, 12130–12135 (2004). | Article | PubMed | ChemPort |
  5. Ficarro, S.B. et al. Phosphoproteome analysis by mass spectrometry and its application to Saccharomyces cerevisiae. Nat. Biotechnol. 20, 301–305 (2002). | Article | PubMed | ISI | ChemPort |
  6. Gruhler, A. et al. Quantitative phosphoproteomics applied to the yeast pheromone signaling pathway. Mol. Cell. Proteomics 4, 310–327 (2005). | PubMed | ISI | ChemPort |
  7. Nuhse, T.S. , Stensballe, A. , Jensen, O.N. & Peck, S.C. Large-scale analysis of in vivo phosphorylated membrane proteins by immobilized metal ion affinity chromatography and mass spectrometry. Mol. Cell. Proteomics 2, 1234–1243 (2003). | PubMed | ISI |
  8. Rush, J. et al. Immunoaffinity profiling of tyrosine phosphorylation in cancer cells. Nat. Biotechnol. 23, 94–101 (2005). | Article | PubMed | ISI | ChemPort |
  9. Collins, M.O. et al. Proteomic analysis of in vivo phosphorylated synaptic proteins. J. Biol. Chem. 280, 5972–5982 (2005). | PubMed | ISI | ChemPort |
  10. Trinidad, J.C. , Specht, C.G. , Thalhammer, A. , Schoepfer, R. & Burlingame, A.L. Comprehensive identification of phosphorylation sites in postsynaptic density preparations. Mol. Cell Proteomics 5, 914–922 (2006). | PubMed | ChemPort |
  11. MacCoss, M.J. Computational analysis of shotgun proteomics data. Curr. Opin. Chem. Biol. 9, 88–94 (2005). | Article | PubMed | ChemPort |
  12. Peng, J. , Elias, J.E. , Thoreen, C.C. , Licklider, L.J. & Gygi, S.P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2, 43–50 (2003). | Article | PubMed | ISI | ChemPort |
  13. Elias, J.E. , Gibbons, F.D. , King, O.D. , Roth, F.P. & Gygi, S.P. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22, 214–219 (2004). | Article | PubMed | ISI | ChemPort |
  14. DeGnore, J.P. & Qin, J. Fragmentation of phosphopeptides in an ion trap mass spectrometer. J. Am. Soc. Mass Spectrom. 9, 1175–1188 (1998). | Article | PubMed | ChemPort |
  15. Schwartz, D. & Gygi, S.P. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat. Biotechnol. 23, 1391–1398 (2005). | Article | PubMed | ISI | ChemPort |
  16. Elias, J.E. , Haas, W. , Faherty, B.K. & Gygi, S.P. Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat. Methods 2, 667–675 (2005). | Article | PubMed | ChemPort |
  17. Pawson, T. & Scott, J.D. Protein phosphorylation in signaling—50 years and counting. Trends Biochem. Sci. 30, 286–290 (2005). | Article | PubMed | ISI | ChemPort |
  18. Ballif, B.A. et al. Quantitative phosphorylation profiling of the ERK/p90 ribosomal S6 kinase-signaling cassette and its targets, the tuberous sclerosis tumor suppressors. Proc. Natl. Acad. Sci. USA 102, 667–672 (2005). | Article | PubMed | ChemPort |
  19. Stemmann, O. , Zou, H. , Gerber, S.A. , Gygi, S.P. & Kirschner, M.W. Dual inhibition of sister chromatid separation at metaphase. Cell 107, 715–726 (2001). | Article |