GproDIA enables data-independent acquisition glycoproteomics with comprehensive statistical control

Large-scale profiling of intact glycopeptides is critical but challenging in glycoproteomics. Data independent acquisition (DIA) is an emerging technology with deep proteome coverage and accurate quantitative capability in proteomics studies, but is still in the early stage of development in the field of glycoproteomics. We propose GproDIA, a framework for the proteome-wide characterization of intact glycopeptides from DIA data with comprehensive statistical control by a 2-dimentional false discovery rate approach and a glycoform inference algorithm, enabling accurate identification of intact glycopeptides using wide isolation windows. We further utilize a semi-empirical spectrum prediction strategy to expand the coverage of spectral libraries of glycopeptides. We benchmark our method for N-glycopeptide profiling on DIA data of yeast and human serum samples, demonstrating that DIA with GproDIA outperforms the data-dependent acquisition-based methods for glycoproteomics in terms of capacity and data completeness of identification, as well as accuracy and precision of quantification. We expect that this work can provide a powerful tool for glycoproteomic studies.


Fission yeast SSL
A sample-specific spectral library of fission yeast generated from DDA data using a 6 h LC gradient with 3 repeat injections. An extra DDA injection with an 1 h LC gradient has been used for RT calibration. Data with the 1 h gradient have also been appended to the calibrated library.

Fission yeast LRL
A lab repository-scale spectral library of fission yeast generated by combining the SSL library and fission yeast data of previous projects in our labs a . RTs have been calibrated to the 1 h LC gradient.

Fission yeast EXL
An extended spectral library of fission yeast generated by combining the SSL library and a semi-empirical library generated from the SSL library.

Budding yeast
A lab repository-scale spectral library of budding yeast generated from budding yeast DDA data. RTs have been calibrated to the 1 h LC gradient using an extra injection of DDA.
Serum SSL A sample-specific spectral library of human serum generated from DDA data using an 1 h LC gradient with 20 fractions. An extra DDA injection with an 1 h LC gradient without fractionation has been used for RT calibration. Data with the 1 h gradient and without fractionation have also been appended to the calibrated library. Containing 3518 precursors of 2402 glycopeptides, 2082 site-specific glycans, 396 protein glycosites (excluding decoys).
Serum LRL A lab repository-scale spectral library of human serum generated by combining the SSL library and serum data of previous projects in our labs a . RTs have been calibrated to the 1 h LC gradient. Containing 5734 precursors of 4011 glycopeptides, 3519 site-specific glycans, 571 protein glycosites (excluding decoys).
Serum EXL An extended spectral library of human serum generated by combining the SSL library and a semi-empirical library generated from the SSL library. Containing 5508 precursors of 3433 glycopeptides, 3009 site-specific glycans, 396 protein glycosites (excluding decoys).

Yeast + serum SSL
A spectral library generated by combining the budding yeast library and the serum SSL library.
Yeast + serum LRL A spectral library generated by combining the budding yeast library and the serum LRL library.

Synthetic
A spectral library of 14 sialylated synthetic glycopeptides generated from DDA data using an 1 h LC gradient with 3 repeat injections. The DDA data and data of fucosylated glycopeptides collected from previous projects of our lab have been used to generate the semi-empirical entrapment glycopeptides, which have been appended to the library. Peptide sequences and glycans are listed in Supplementary Table 2.
Both entrapment A spectral library generated by combining the fission yeast SSL library and 500 precursors randomly sampled from the serum SSL library.
Peptide entrapment A spectral library generated by combining the fission yeast SSL library and 500 entrapment precursors b . The entrapment precursors have been generated semi-empirically using the fission yeast and serum SSL libraries, and then randomly subsampled to 500. The peptide sequences of the entrapment precursors are from human, while the glycans are from yeast.

Glycan entrapment
A spectral library generated by combining the fission yeast SSL library and 500 entrapment precursors b . The entrapment precursors have been generated semi-empirically using the fission yeast and serum SSL libraries, and then randomly subsampled to 500. The peptide sequences of the entrapment precursors are from yeast, while the glycans are from human.

Serum + plant glycan entrapment
A spectral library generated by combining the serum SSL library and 3500 entrapment precursors b . The entrapment precursors have been generated semiempirically using the serum SSL libraries and data of Arabidopsis thaliana glycopeptides collected from previous projects of our lab, and then randomly subsampled to 3500. The peptide sequences of the entrapment precursors are from human, while the glycans are from A. thaliana.

Both entrapment EXL
A spectral library generated by combining the fission yeast EXL library and 850 precursors randomly sampled from the serum SSL library.

Peptide entrapment EXL
A spectral library generated by combining the fission yeast EXL library and 850 entrapment precursors b . The entrapment precursors have been generated semi-empirically using the fission yeast and serum SSL libraries, and then randomly subsampled to 850. The peptide sequences of the entrapment precursors are from human, while the glycans are from yeast.

Glycan entrapment EXL
A spectral library generated by combining the fission yeast EXL library and 850 entrapment precursors b . The entrapment precursors have been generated semi-empirically using the fission yeast and serum SSL libraries, and then randomly subsampled to 850. The peptide sequences of the entrapment precursors are from yeast, while the glycans are from human. a The combination of libraries is performed using the approach of generating "consensus" spectrum when multiple spectra for a glycopeptide exist in different libraries. During the procedure, some glycopeptides can be eliminated if the consensus spectrum does not meet the criteria described in the Spectral library building in the Methods. b As a special case, when generating entrapment libraries using the semi-empirical approach, variants with the same glycan monosaccharide compositions but different isomeric glycan structures are regarded as different precursors which can have different glycan fragment peaks. The same criteria are followed when counting entrapment identifications in DIA results. "N" replaced by "J" denotes the glycosylation site. A glycan composition is represented in the form of "N-H-A-F", where "H" stands for Hex, "N" stands for HexNAc, "A" stands for NeuAc, and "F" stands for Fuc. Supplementary Fig. 2. Contour plots of posterior error probability (PEP) of peak groups extracted from the fission yeast DIA data using the SSL library. (a) PEP that the peptide part of a peak group is a false identification. (b) PEP that the glycan part of a peak group is a false identification. (c) PEP that both the peptide and the glycan parts of a peak group are false identifications. (d) PEP that both/either the peptide and/or glycan parts of a peak group are false identifications. Green color indicates target peak groups, yellow indicates peptide decoy peak groups, blue indicates glycan decoy peak groups, and red indicates both decoy peak groups. Source data are provided as a Source Data file. no outliers are shown. The dashed lines indicate theoretical fold changes of the organisms (1:0.9:0.8 (S10:S12:S15) for human and 1:1.1:1.2 (S10:S12:S15) for yeast). no outliers are shown. The dashed lines indicate theoretical fold changes of the organisms (1:0.9:0.8 (S10:S12:S15) for human and 1:1.1:1.2 (S10:S12:S15) for yeast). . The run-specific context conducts separate error rate estimation for each run, whereas the global context only considers the best-scoring peak group per analyte across the entire experiment. Color gradient indicates the D-score relatively.

Supplementary Note 1. Optimization of the search space of glycoform inference
When generating identification transitions for glycoform inference, only the top nbg background glycan structures were included in the spectral library to limit the search space. We tested different nbg from 20 to 70 on the fission yeast data by the entrapment approach. The results using glycan entrapment libraries are presented in Supplementary Fig. 12. After applying 1% glycoform-level q-value filter, the entrapment percentage declined from 1.5% to 0.6% with the increase of nbg from 20 to 70. When nbg = 50, the entrapment percentage was ~1%. Therefore, we chose 50 for a trade-off between accuracy and size of search space.

Supplementary Note 2. Performance comparison of comprehensive statistical control by GproDIA with the peptide-only FDR control
We compared the statistical control of glycopeptide error rates by GproDIA with the peptide-only FDR control approach designed for peptide DIA analysis on the fission yeast data using the entrapment strategy. Two glycan entrapment libraries were built: (i) one only contains peptide b/y, b-N1/y-N1 and b$/y$ fragments, excluding all Y ions to follow the reported strategies where spectral libraries were generated from deglycosylated peptides 1,2 or peptides with truncated glycans 3 ; (ii) the other contains both peptide fragments and Y ions. The entrapment glycopeptides were built with peptide sequences from yeast and glycans from human. Only peptide decoys were generated and appended to the entrapment libraries, while no glycan decoys were used.
The fission yeast data were analyzed using the OpenSWATH-PyProphet pipeline 4,5 for peptide analysis. When scoring the peak groups using PyProphet, both MS1 and MS2 features were considered by setting the parameter "--level=ms1ms2". Peak group qvalue cutoff was 1%. Peptide/protein inference and subsequent steps were not performed. The results are presented in Supplementary Fig. 13. The peptide-only FDR approach for peptides cannot address error rate control for glycopeptides properly, which again stresses the significance of comprehensive statistical control by the 2D FDR and glycoform inference.

Supplementary Note 3. Benchmarking using synthetic glycopeptides.
We further benchmarked the performance of GproDIA to differentiate glycoforms with near identical masses on data of a synthetic glycopeptide sample. The sample contained 14 synthetic glycopeptides of 7 peptide sequences and 2 sialylated glycans for each sequence (Supplementary Table 2 Table 1).
DIA was performed with 1 h LC gradient and 3 repeat injections. Since the same LC condition was used for DDA and DIA, retention time normalization was not performed and no anchor TraML file was specified. Multi-run alignment was not performed, and no global glycopeptide-level q-value filter was applied. After applying 1% peak grouplevel q-value and 1% glycoform-level q-value filter, 54 peak groups were reported, including 2 (~4%) entrapment peak groups. From the 3 DIA replicate runs, 13 of the 14 glycopeptides (93% recall) were detected totally, while 1 entrapment glycopeptides were reported (Supplementary Data 10 and Supplementary Fig. 14).

Supplementary Note 5. Examination of oxonium ions in DIA data
In DDA data analysis of glycopeptides, glycan oxonium ions are usually utilized as diagnostic peaks and aid in identification of different types of glycans 7 Data 11 and 12).

MS/MS
Among the DIA identification results of the human serum data using the serum SSL library with glycoform inference enabled, there were 265 site-specific glycans (corresponding to 322 glycopeptide precursors) missed by DDA, considering the identifications shared in 2/3 runs (Fig. 4c) Fig. 18).
It should be noticed that we used the targeted MS/MS to support the identification results by DIA, which does not indicate that the glycopeptides not observed by the targeted MS/MS were wrong.

Supplementary Note 7. Limitations of the 2-dimensional FDR approach for small datasets.
For the peptide part of the 2-dimensional FDR approach, the reverse decoy approach was inherited from the DIA analysis method for non-glycosylated peptides. For the glycan part, random mass shifts were performed on the glycan fragment peaks, which was initially reported in pGlyco 7 . Typically, the peptide FDR approach based on reverse decoys is applied for large datasets, and may be biased for small datasets.
The debate on how the decoy peptides should be assembled has been around for more than ten years. The principle of decoy transitions in peptide-centric DIA analyses is conceptually related to but different from the decoy database approach used for DDA database searching. In contrast to the decoy database approach in DDA that operates on the database searching level, the decoy transitions in DIA are introduced at the measurement level 12 . Notably, the decoy strategy in DIA analyses was initially proposed for selected reaction monitoring (SRM) feature scoring. In our study, the spectral libraries (except the synthetic dataset) contain hundreds to thousands of glycopeptide precursors (more than 1000 if including entrapments), which is not much fewer than the peptide assay in an ordinary SRM experiment.
In the statistical control step of DIA analyses, a semi-supervised learning algorithm is executed iteratively to optimize the weights of individual score combinations to separate targets and decoys 13 . Sufficient numbers of training data are normally necessary for the semi-supervised algorithm. Also, the decoy distribution should match the true-negative part of the target distribution. If the spectral library is too small, these conditions may not be fulfilled, and the machine learning and FDR estimation may be biased. In this study, however, we used the entrapment strategy to assess the quality of statistical control. Even for the synthetic dataset, which is the smallest dataset in the study, the false positives in the peptide part were still well controlled.
In summary, the current peptide FDR approach has limitation for small datasets.
However, it is, to the best of our knowledge, the only choice for peptide-centric DIA analyses at present. Indeed, the emphasis of this work is on algorithms for glycopeptide identification, especially in the glycan part, and the performance of the current method is still acceptable with the validations by different methods, i.e. entrapment library and targeted MS/MS analysis.