pGlyco: a pipeline for the identification of intact N-glycopeptides by using HCD- and CID-MS/MS and MS3

Confident characterization of the microheterogeneity of protein glycosylation through identification of intact glycopeptides remains one of the toughest analytical challenges for glycoproteomics. Recently proposed mass spectrometry (MS)-based methods still have some defects such as lack of the false discovery rate (FDR) analysis for the glycan identification and lack of sufficient fragmentation information for the peptide identification. Here we proposed pGlyco, a novel pipeline for the identification of intact glycopeptides by using complementary MS techniques: 1) HCD-MS/MS followed by product-dependent CID-MS/MS was used to provide complementary fragments to identify the glycans, and a novel target-decoy method was developed to estimate the false discovery rate of the glycan identification; 2) data-dependent acquisition of MS3 for some most intense peaks of HCD-MS/MS was used to provide fragments to identify the peptide backbones. By integrating HCD-MS/MS, CID-MS/MS and MS3, intact glycopeptides could be confidently identified. With pGlyco, a standard glycoprotein mixture was analyzed in the Orbitrap Fusion, and 309 non-redundant intact glycopeptides were identified with detailed spectral information of both glycans and peptides.


Contents
(see Table S-1). These peaks were removed from all the spectra before identification. Meanwhile, in HCD-MS/MS spectra at 40% NCE, we found that the peak 138.055 was always the most intensive peak for glycopeptides. And our statistics showed that with the peak 138.055 above the relative intensity of 30% in HCD-MS/MS (@NCE = 40%), 99.4% spectra had the 204.087 ion and 93.5% spectra had the 366.140 ion, which implied that it was enough to select glycopeptide precursors by using the ion 138.055. So we used the 138.055 peak to trigger the CID-MS/MS and MS3 for true glycopeptides in our dataset.   The spectrum-based decoy method and the FMM Scores of decoy identifications could be used to construct the score distribution of incorrect matches, while the score distribution of target identifications is a mixture score distribution of both correct and incorrect identifications. For the target-decoy approach in the field of peptide identification, the basic assumption is "the number of incorrect identifications from target or decoy sequences are equally likely [2]," which might not be surely guaranteed by constructing decoy identifications for the glycan identification. Therefore, to solve this mixture model of target-decoy approach of glycan identifications, we employed a finite mixture model to estimate the density functions of the correct and incorrect score distributions, which were denoted by f(x|+) and f(x|−). And the mixture probability of incorrect identifications, denoted by π0, was estimated at the same time. f(x|−) was modeled by the finite gamma-mixture model, which used several gamma distributions to fit decoy scores by the expectationmaximization (EM) algorithm, and the number of gamma components was determined by the Bayesian information criterion (BIC) [3]. And f(x|+) was then estimated by finite Gaussian-mixture models. The EM algorithm was also employed, which was listed below:   One question about a novel FDR estimation method is that the FDR may be underestimated. As one validation, we tested the finite mixture model (FMM) and the spectrum-based decoy method on routine peptide identification problems; if underestimation occurred in the glycan FDR estimation, it would probably occur in the peptide FDR estimation as well, as a saccharide residue was analogous to an amino acid residue. In our test, two public HCD datasets of HeLa cells generated by the Orbitrap Velos and Q-Exactive respectively [6] and an HCD dataset of 8 standard proteins generated by LTQ-Orbitrap XL [7,8] were used. The RAW data files were converted into MGF files by pXtract, and then were searched by pFind 2.8. The protein sequence database was SwissProt (v12.05, Homo sapiens species) for Velos or QE dataset; for the dataset of standard proteins, the protein sequence database was the sequence of 8 standard proteins mixed with sequences of yeast, as described in [7].
The enzyme was trypsin and the maximal number of missed cleavages was 2. The key assumption of the target-decoy method was "the number of incorrect identifications from target or decoy sequences are equally likely [2]", and we called it the "1:1 assumption". Based on the "1:1 assumption", the FDR could be estimated as #decoy/#target. It was widely accepted that the "1:1 assumption" always hold when using the sequence-based decoy method. In the results of Velos dataset, the number of sequence-based decoy identifications was 3515, which implied there were approximate 3515 incorrect results in target identifications. However, the number of spectrum-based decoy identifications was only 1569, the ratio was far from 1:1 (3,515 : 1,569 ≈ 2.25 : 1). And in the QE dataset, the ratio was 9119 to 3691 (9,119 : 3,691 ≈ 2.5 : 1). Therefore, there was no evidence showing that "the number of incorrect identifications from target or decoy sequences are equally likely" or the 1:1 assumption still held when using the spectrum-based decoy method, and this was why we used the FMM algorithm to model the bias of "1:1 assumption", so as to make the FDR estimated by the spectrum-based decoy method be similar to that of the sequence-based target-decoy method.
On the Velos dataset, we also tested the spectrum-based decoy method with different random mass ranges by using the same search parameters, such as adding a fixed mass to each y ion, adding a random mass ranging from −15 to 15 Da, adding a random mass ranging from 1 to 20 or adding a random mass ranging from 1 to 40, and we found that adding a random mass ranging from 1 to 30 Da was at least a sub-optimal spectrumbased decoy method, as illustrated in Figure S

Best parameter for the Y 1 ion filtration
The trimannosyl core is the stable structure in N-glycans, and it is feasible to use this information to filter out unreliable Y1 ions. In pGlyco, there should be at least 3 matched trimannosyl core ions or a (Y1, Y1*) ion pair with the same charge state for a reliable Y1 ion. This parameter should be verified for its optimality. The verification could be done based on the 765 manually checked GPSMs, as shown in Table S-3.
Under the default Y1 filtration condition of pGlyco, the sensitivity and accuracy were 80.0% and 99.8% at 1% glycan FDR (the peptide FDR of MS3 identification was 1%).
20.0% true positive GPSMs did not get a high score for filtration. When using the Y1/Y1* ion pair alone, the sensitivity decreased to 53.5%. And when using "core ≥ 3" alone, the sensitivity and accuracy were quite acceptable, so it was not bad to use "core ≥ 3" only for the Y1 ion filtration as described in previous work [9]. The accuracy increased from 84.7% to 90.4% when using the parameter "core ≥ 3 or Y1/Y1*", as compared to the parameter "core ≥ 3". When increasing the number of matched trimannosyl core ions to 4, 5 or 6, the sensitivity dropped down, although the accuracy was very high even without FDR cutoff. To balance the sensitivity and accuracy, pGlyco used the condition "core ≥ 3 or Y1/Y1*" to filter Y1 ions. The parameter "core ≥ 1 (or 2)" was not tested in Table S-3, because it was difficult to judge if the GPSM is correct with only one or two trimannosyl core ions matched. "core ≥ x" means the Y1 ion is filtered by at least x trimannosyl core ions. "Sensitivity" or "accuracy" means the sensitivity or accuracy is tested after the filtration with the glycan FDR ≤ 1% by pGlyco. And "sensitivity (no FDR)" or "accuracy (no FDR)" means the sensitivity or accuracy is tested without the filtration of glycan FDR ≤ 1%. The parameter "core ≥ 1 (or 2)" is not tested here, because it is difficult to judge if a GPSM is correct with only one or two trimannosyl core ions matched.   (Figure S-6a). But this peptide was well fragmented by HCD ( Figure S-6b), in which the intensities of Y0, Y1* and Y1−H2O were much lower.
Furthermore, HCD fragmentation has overcome the low-mass cutoff problem [10], so it provides more complete b, y ions for the peptide identification. Although sometimes the parent ion of MS3 was not well fragmented in HCD, leaving highly intense Y1 and Y1* ions, such as the peptide "LVPVPITJ[+HexNAc]ATLDR", as shown in Figure S-