A streamlined pipeline for multiplexed quantitative site-specific N-glycoproteomics

Regulation of protein N-glycosylation is essential in human cells. However, large-scale, accurate, and site-specific quantification of glycosylation is still technically challenging. We here introduce SugarQuant, an integrated mass spectrometry-based pipeline comprising protein aggregation capture (PAC)-based sample preparation, multi-notch MS3 acquisition (Glyco-SPS-MS3) and a data-processing tool (GlycoBinder) that enables confident identification and quantification of intact glycopeptides in complex biological samples. PAC significantly reduces sample-handling time without compromising sensitivity. Glyco-SPS-MS3 combines high-resolution MS2 and MS3 scans, resulting in enhanced reporter signals of isobaric mass tags, improved detection of N-glycopeptide fragments, and lowered interference in multiplexed quantification. GlycoBinder enables streamlined processing of Glyco-SPS-MS3 data, followed by a two-step database search, which increases the identification rates of glycopeptides by 22% compared with conventional strategies. We apply SugarQuant to identify and quantify more than 5,000 unique glycoforms in Burkitt’s lymphoma cells, and determine site-specific glycosylation changes that occurred upon inhibition of fucosylation at high confidence.


Supplementary note 1 Selection of quantification methods for intact glycopeptides
Using recent advancements in bioanalytical mass spectrometry 1 , researchers have employed a variety of methods for quantitative analysis of intact glycopeptides. The heterogeneity of protein glycosylation, however, often makes the quantification challenging. For instance, label-free methods using MS1 extracted ion currents (XICs) or spectral counts have been applied in glycoproteomics with the advantage of simple workflows and lower cost [2][3][4] , but require sophisticated normalization methods to account for the MS response variations in measurements and the varied ionization efficiency of glycopeptides bearing diverse glycan composition 4 . This approach also suffers from severe missing values in large-scale glycoproteomics due to drastic differences in glycoform abundances, and from low identification rates of less abundant glycopeptides in data-dependent acquisition (DDA) analysis 3 . More recently, data-independent acquisition (DIA) methods have shown the potential to quantitate intact glycopeptides with higher sensitivity and less missing values 5,6 . However, the lack of universal spectral libraries for N-glycopeptides still hampers the applications of DIA methods in large-scale glycoproteomics.
Metabolic labelling, such as stable isotope labelling by amino acids in cell culture (SILAC) 7 , allows to combine different samples right after cell harvest, which minimizes possible quantification errors introduced during sample preparation. However, this method is not applicable to many biological materials, especially to patient tissues, and has a limited number (usually up to three) of conditions to be compared in one measurement. In addition, the fact that heterogeneous glycoforms on one peptide core have closely related masses and do not separate well on the standard C18 chromatography often makes MS1 spectra of intact glycopeptide analysis more complicated. SILAC inherently further increases the MS1 complexity, resulting in interfered SILAC pair determination and XIC extraction. The introduced SILAC pair can also cause over-sequencing of the same glycopeptides and undersampling in DDA analysis.
Isobaric chemical labelling using TMT or iTRAQ reagents, on the other hand, can apply to all types of samples 8 . It enables sample-multiplexing, reducing overall LC-MS measurement time and variations introduced between replicates. Importantly, by pooling all samples together, it boosts the signal of low abundance species that are otherwise not detectable in any individual sample 9 . TMT labelling itself also increases the ionization efficiencies of peptides or glycans 10 . Since the majority of the glycoforms are present with low stoichiometry, this feature enhances the sensitivity and the depth of site-specific glycoproteomics. However, despite its successful applications in quantitative glycoproteomics [11][12][13][14] , a systematic optimization of experimental parameters for chemically labelled glycopeptides is still missing (see main text). Also, co-isolation interference that occurred in a standard DDA MS/MS analysis can impair the quantification accuracy of chemically labelled peptides and cause ratio compression 15 . Considering the heterogeneity of glycosylation and the closely related masses of glycopeptides, such interference will strongly impact glycoproteomics. We thus developed Glyco-SPS-MS3 in the SugarQuant pipeline to minimize the interference based on the strategy of multi-notch MS.
Another important factor to interfere with glycopeptide quantification is in-source dissociation (ISD), the loss of terminal glycan moieties in source during LC-MS/MS analysis. ISD is problematic for accurate quantification of glycopeptides, especially for those bearing sialic acid. When a glycopeptide is fragmented and loses glycan moieties in the source, the resulting fragmented glycopeptide and its original "parent" glycopeptide will be detected at the same retention time in the liquid chromatography. Once the chromatography does not separate glycopeptides sharing the same peptide sequence but bearing different glycans apart, the newly formed ISD glycopeptide can result in an overlapped ion chromatogram with its natural, unfragmented isomers. MS1-based quantification, including label-free and SILAC, will be compromised by both the reduced intensity of fragmented glycopeptide and the overlapped ion chromatogram. Isobaric labeling-based quantification, on the other hand, is not affected by the former because glycopeptides from all samples are pooled and should theoretically experience equal effects of ISD. However, the ISD glycopeptide and its natural isomers can be co-selected for MS2 and MS3 analyses, leading to an inaccurate quantification of natural glycopeptides.

Development and optimization of sample preparation for quantitative glycoproteomics
Many glycoproteins of biomedical interest participate in processes like e.g. cell surface recognition, and are thus membrane-associated. Complete solubilization especially of these membrane-associated glycoproteins with the assistance of detergents or chaotropic reagents is a critical and necessary step of sample preparation in glycoproteomics. Sodium dodecyl sulfate (SDS) is commonly used because of its outstanding capacity for recovering membrane proteins from a variety of biological materials 16 . However, SDS diminishes trypsin activity and causes severe ion suppression in MS analysis, necessitating an additional removal step prior to MS analysis. Conventionally, SDS is eliminated via protein precipitation. Many other methods have also been developed for detergent removal [17][18][19] . Unfortunately, those methods increase the risk of sample loss and make the entire procedure laborious and time-consuming. Alternatively, acid-labile detergents, such as RapiGest (Waters), have been demonstrated to improve solubilizing hydrophobic proteins/peptides and facilitate complete proteolysis 20 . They undergo hydrolysis under acidic conditions and is compatible with MS analysis without the need for any extra clean-up steps. Similarly, urea is also commonly used in proteomics as a substitute for detergents to denature proteins and enhance the solubility. A simple desalting step is sufficient to remove urea from the samples, albeit an extensive dilution prior to proteolysis is often required due to its inhibitory effect on proteases at high concentrations.
The selection of detergents or chaotropic reagents directly affects not only the solubility of proteins but also the duration and complexity of the entire sample preparation procedures. We thus compared workflows using SDS, urea and RapiGest for protein extraction (see Online Methods). In our hands, SDS and urea outperformed RapiGest, and allowed more proteins and glycoforms to be identified in the following MS analyses (Supp. Fig.  1a), although the RapiGest workflow is more straightforward and faster. Considering that residual urea can negatively influence TMT labelling efficiency, SDS becomes the optimal choice for multiplex quantitative glycoproteomics. However, the conventional approach to remove detergent via protein precipitation is too timeconsuming. It also takes extra time and efforts to re-dissolve the pellet. We thus sought to simplify the workflow, and reduce handling time.
Recently, Hughes et al. described single-pot solid-phase enhanced sample preparation (SP3) and demonstrated its capacity for fast and efficient proteome sample preparation 21 . Olsen group further investigated the underlying mechanism and found that protein clean-up occurred irrespective of microparticle surface chemistry but instead via protein aggregation capture (PAC) 22 . We removed SDS by PAC on magnetic beads followed by trypsin digestion, which shortened handling time without the need to resolubilize hard protein pellets. We compared the capabilities of different magnetic beads bearing various surface functional groups to retain protein and found that they all worked properly and resulted in less than 5% difference in protein identifications (Supp. Fig. 1b). We further reduced the digestion time from overnight to 4 hours and showed no significant decrease in protein identifications (up to 5.3%). Instead, we identified up to 9.1% and 4.1% more glycoforms and glycosites, respectively.
To reduce the cost of TMT reagents, we optimized the ratio of peptides to TMT labeling reagents. In this study, we labeled the peptides in a TMT-to-peptide ratio (wt/wt) of 2:1, a four-fold reduction of TMT reagent than recommended by the vendor (800 μg TMT to 100 μg peptides, 8:1). The labeling efficiency is above 99%. For every biological replicate of the fucosylation-inhibited samples, we labeled 400 μg peptides in each condition using only one set of TMT reagents.
We also introduced basic reverse-phase chromatography (bRP) to pre-fractionate the ZIC-HILIC enriched glycopeptides, and showed its advantages to identify 53% more glycopeptides as compared to repetitive injections (Supp. Fig. 1c-e). We further optimized chromatographic settings specifically for TMT-labelled glycopeptides to enhance sensitivity in LC-MS/MS. In summary, the optimized workflow involves SDS-assisted protein extraction, PAC clean-up and proteolysis, TMT labelling, ZIC-HILIC glycopeptide enrichment, and bRP pre-fractionation. The whole workflow can be finished in one day, a three-fold improvement in processing time as compared with a conventional protein precipitation method (Supp. Fig. 1g).

Development of Glyco-SPS-MS3 for confident glycopeptide identification and accurate quantification
Confident glycopeptide identification using tandem MS relies on the match of a comprehensive series of both glycan and peptide fragments from the acquired spectrum. A myriad of different fragmentation strategies has been developed for this purpose 23 . Among others, stepped collision energy HCD (sNCE-HCD) and AI-ETD have recently proved their advantages to achieve large-scale identification of intact glycopeptide in complex samples using rather simple MS acquisition methods and data process workflows 24,25 . However, none of those methods have been applied to chemically labelled glycopeptides e.g. for multiplexed quantification. Our results suggested the necessity to optimize CE settings specifically for TMT-labelled glycopeptide (see main text). A multi-stage fragmentation method would allow to cover a broader range of CEs for different fragment ion series, including glycan Y ions, peptide b-/y-ions, and TMT reporter ions. Indeed, characterization of glycopeptides using MS3based methods has shown welcoming advantages [26][27][28] . For example, by looking for and specifically selecting the Y1 ion (peptide core carrying a single HexNAc) for further fragmentation, additional peptide b-/y-ions could be detected in MS3 to support the peptide sequence identification 29 . However, this requires prior knowledge of the targeted peptides to select for MS3 and sufficient parent ion intensity to generate useful information from the MS3 experiment, thus limiting its throughput and sensitivity. The lack of software tools to automatically interpret and combine information from different fragmentation stages further limits the applicability of MS3 approaches for large scale analysis. To overcome the sensitivity issue of MS3 detection, Gygi and his coworkers developed synchronous (or multi-notch) precursor selection (SPS) 30,31 to simultaneously select a pre-defined number of MS2 fragments (named notches) for high NCE (65%) fragmentation to produce reporter ions more efficiently. SPS-MS3 has shown to be more accurate in quantification than the standard MS2 approaches because of reduced interference from co-isolated ions.
In this work, we developed Glyco-SPS-MS3 to combine the advantages of multi-stage fragmentation and multinotch selection for both confident identification and accurate quantification of chemically labelled intact glycopeptides. Unlike the original SPS-MS3 that utilizes fast ion trap scans and paralleled CID fragmentation for high-speed MS2 peptide sequencing, Glyco-SPS-MS3 uses HCD and Orbitrap detection in both MS2 and MS3. Our results show that HCD fragmentation followed by high-resolution, high-accuracy Orbitrap detection provided more high-scored N-glycopeptide identifications from IgM digests (Supp. Fig. 6). Glyco-SPS-MS3 applies different HCD NCEs for MS2 and MS3 fragmentation, resulting in the production of complementary sets of fragments. High-resolution and high-accuracy Orbitrap detection of both MS2 and MS3 fragments reduced misidentifications. Multi-notch selection not only enhanced the detection sensitivity of both reporter ions and peptide b-/y-ions, but also decreased co-isolation interference. We fine-tuned several key MS instrument parameters as detailed below (Supplementary Table 1). To broaden the applicability, we optimized the settings for both Fusion and Lumos tribrid instruments (Supplementary Table 2).

MS acquisition cycle
Original SPS-MS3 fragments isolated precursors with collision-induced dissociation (CID) followed by ion trap MS2 detection. It then selects multiple MS2 fragments for further HCD fragmentation with NCE 65 in the ionrouting multipole (IRM) and sends all resulting ions for Orbitrap MS3 detection. The MS2 and MS3 scans are parallelized to reduce the overall duty cycle. However, we require HCD in our Glyco-SPS-MS3 to generate more Y<5 ions in MS2. In addition, high resolution and high accuracy Orbitrap scans are required to reduce misidentifications of glycopeptides. Therefore, the entire operation cycle of our Glyco-SPS-MS3 works as follows: A Glyco-SPS-MS3 cycle starts with an MS1 survey scan detected by the Orbitrap. Top intense ions fulfilled our criteria are selected in the quadrupole, fragmented with HCD using lower NCE, and sent to the Orbitrap for MS2 detection. Among the MS2 fragments, the top ten ions in the m/z range of 700-2000 are co-selected using SPS technique in the ion trap, sent back to IRM for HCD fragmentation with higher NCE, and then detected in the Orbitrap.
Because of the use of the Orbitrap for both MS2 and MS3 detection, the operation of Glyco-SPS-MS3 cannot be parallelized. It thus resulted in longer cycle time. We nevertheless deem it worthwhile because of the gained advantages in both identification quality and quantitation accuracy.

Collision Energy
We used 12 TMT-labelled IgM glycopeptides to evaluate the fragmentation patterns under various NCEs, and it suggested the need to apply a wider range of NCE to cover all wanted fragment ions ( Fig. 2a and Supp. Fig. 3).
Stepped NCE helped to generate more glycan Y ions and peptide b-/y-ions, but the resulting reporter ions were still with lower intensities (Fig. 2c). Possibly the more labile glycosidic bonds broke first when colliding with gas molecules and took most of the energy that limited the generation of reporter ions 13 . We thus thought that a secondstage fragmentation of the Y ions that bore smaller glycans would result in higher reporter signals, similar to the previous MS3 methods for better detection of peptide b-/y-ions. Instead of choosing only the Y1 ions, we employed the SPS technique to simultaneously select and fragment multiple Y ions for improved sensitivity. We fragmented the TMT-labelled glycopeptide precursor with NCE 25-30 that preferred the production of glycan Y ions and selected only the MS2 fragments in the m/z region of 700-2000 for MS3 to exclude intense oxonium ions. We evaluated multiple NCE combinations in the Glyco-SPS-MS3 (Supp. Fig. 10) and found that NCE 25 at MS2 plus NCE 35-40 at MS3 indeed resulted in better glycopeptide identification. The resulting total score, peptide score, and glycan score obtained from Glyco-SPS-MS3 were all higher than that from a standard MS2 method.

Ion target and injection time
pGlyco 2.0 considers both glycan and peptide fragments in the scoring algorithm with FDR control so that it demands high spectrum quality to identify a glycopeptide. Consistent with previous reports 24, 26 , our results showed that higher AGC targets helped to identify more glycopeptides with either pGlyco or Byonic search engines (Supp. Fig. 8). We reasoned that higher AGC benefited the identification of low abundance and/or low ionization efficiency glycopeptides. However, it required longer ion accumulation time (or ion injection time, IT) to reach the desired ion amounts, resulting in prolonged cycle time during MS/MS analysis. We used 100 ms of IT to analyze IgM digests and found that only 81%, 70%, and 5% of the spectra reached the pre-defined AGC targets of 5e 4 , 1e 5 , and 5e 5 , respectively. Prolonging the IT to 250 and 500 ms allowed 29% and 68% spectra, respectively, to reach the AGC of 5e 5 . However, using 500 ms IT for both MS2 and MS3 in our Glyco-SPS-MS3 would make the duty cycle too long to couple with LC separation. We thus decide to use the maximum IT of 500 ms for a precursor and allocate it to MS2 and MS3 scans variously. Interestingly, longer IT in MS3, or shorter IT in MS2, led to increased peptide score along with slightly decreased glycan score. This observation is in agreement with our suggestions that low-NCE MS2 provided more glycan Y ions, while high-NCE MS3 generated more peptide b/y ions. Of note, we did not observe any deviated mass accuracy caused by the space charging effect 32 with these settings.
We compared the required cycle time of our Glyco-SPS-MS3 with previously published acquisition methods including HCD triggered AI ETD, ETD and EThcD methods by Rieley et al 25 . The MS1 acquisition parameters were essentially the same, requiring 120,000 resolution at 200 m/z and maximum injection time of 50 ms and automatic gain control set to 400,000 (Rieley et al.) or 500,000 (SugarQuant). The cycle time was automatically controlled by the machine and set to 3 s. Therefore, the main factor that limits acquisition speed and affects the overall cycle time is the total maximum injection time allowed for the MS2-MS3 or HCD-triggered (AI)ET(hc)D scan pairs. Riley et al allocated 460 ms for an HCD/triggered (AI)ET(hc)D scan pair, including 60 ms for a survey HCD scan and 400 ms for a triggered (AI)ET(hc)D scan. In our suggested Glyco-SPS-MS3 settings, we allocated 500 ms for an MS2/SPS MS3 scan pair in total, with 150 ms being used by an MS2 scan and 350 ms being allocated for an SPS MS3 scan. This resulted in an 8% increase in the total time required for an MS2/SPS MS3 scan pair as compared with the triggered ETD methods. The required ion reaction time for (AI-)ET(hc)D, however, is often longer than that for HCD, which makes the difference in overall cycle time between the methods negligible. Importantly, SugarQuant utilizes both MS2 and SPS-MS3 scans and thus brings in additional advantages for reliable glycopeptide identification as well as accurate quantification.

Supplementary note 4 Development and usage of GlycoBinder
GlycoBinder is written in R and is available on GitHub (https://github.com/IvanSilbern/GlycoBinder). It allows streamlined data processing of multiplexed glycopeptide quantitative mass spectrometry data. It relies on the usage of external tools (see below) that are not distributed with the script and have to be requested and installed separately. GlycoBinder does not provide those tools and a user needs to request and install the tools by himself prior to working with GlycoBinder. To our knowledge, all tools are freely available upon request.

Setting up the processing environment
GlycoBinder was developed and tested on machines running on 64-bit platforms under Windows 10. It requires an R programming language (versions 3.5.0 or above) to be installed on your machine including data.table, dplyr, future.apply, and stringr packages. In case those packages are not installed, GlycoBinder will make an attempt to install them. Note the location of "Rscript.exe" file, which is needed to run R scripts in command line (commonly in C:/Program Files/R/R-3.5.0/bin/x64/) External tools should be installed and added to the system path of the machine. This allows for calling the program without specifying an exact path to it. To do so, open windows menu and search for "Edit environment variables for your account". Under "User Variables" select "PATH" and click the "Edit" button (make sure you are changing the "PATH" variable for a user account you will be later working with). Select "New" and then "Browse". Navigate to the directory where the executable of the tool is located (e.g. "C:\Program Files\RawTools-2.0.2\"). Repeat the same procedure for all tools. We also suggest to add the file path to the folder containing "Rscript.exe" file. After the environmental variables are configured, please check if the programs can be accessed from the command line directly. For this, open the command line and type one by one: RScript, rawtools, msconvert, pparse, pglyco. Hit Enter after each command. Make sure that the system can find each tool and returns help information to the console. A tutorial on how to configure the different environmental variables can be found here: https://github.com/kevinkovalchik/RawTools/wiki/Download-and-prepare-RawTools-for-Windows Depending on the number of raw files and their size, GlycoBinder might require a large amount of RAM to process the data. Per default, it will use (the number of available processors -2) threads on your machine for processing the data (this number might vary for external tools). We recommend to reserve at least 1GB of free RAM per running process (e.g. for a machine with 8 cores, one should aim for at least 6 GB of free RAM space). If you would like to restrict the number of processors used by GlycoBinder, please, consult the following section regarding additional parameters to the script.

Processing steps
GlycoBinder is designed for processing .raw files acquired on Thermo Fisher Orbitrap Tribrid instruments. It combines MS2 scans with their dependent MS3 scans. It also extracts reporter ion intensities for subsequent quantification.
In brief, GlycoBinder processes the .raw files using the following steps: 1. RawTools extracts reporter ion intensities from Thermo .raw files and assigns MS3 scans to their parent MS2 scans. 2. msconvert transforms .raw files into .mgf file format and centroids data by applying a vendor-specific peak picking algorithm. Both MS2 and MS3 scans are preserved in the .mgf file. 3. pParse recalibrates the monoisotopic peaks of precursors and outputs an .mgf file containing MS2 scans. 4. GlycoBinder combines ion intensities of matching MS2 and MS3 spectra as reported by RawTools. MS2 and MS3 spectra are extracted from the msconvert-produced .mgf file, and merged based on the specified mass tolerance window. GlycoBinder replaces MS2 spectra in the pParse output by the combined MS2/MS3 spectra. The modified pParse output file is used as input for pGlyco 2.
5. pGlyco 2 uses the combined MS2/MS3 spectra to search for peptides and associated glycans. After the first pGlyco 2-search is finished, results are filtered based on a specified FDR cutoff. 6. Optionally, a second pGlyco 2-search is performed with a smaller protein database. For this, only glycoproteins identified in the first round of pGlyco 2-search passing an FDR threshold are retained in the protein sequence database and used for the second round of glycopeptide search. 7. GlycoBinder combines the resulting GPSMs with the corresponding reporter ion intensities extracted from the same spectra by RawTools. Based on the combined pGlyco 2 and RawTools output, GlycoBinder organizes quantitative results at different levels: at the levels of glycosylated peptides, glycoforms, glycosites, and glycans. A separate data table is reported for each level that contains unique identifiers of the data entries, cross-references to other levels, quantification information in the form of the summed reporter ion intensities and necessary metadata. By re-organizing the combined pGlyco 2 and RawTools output, Glycobinder allows to directly address changes happening at the level of glycopeptides, glycoforms, glycosites or glycans since the quantitative information is conveniently structured at each level.

Use of GlycoBinder
To execute GlycoBinder, follow these steps: 1. Prepare a working directory containing .raw files to be processed and .fasta file containing amino acid sequences of proteins. 2. Open the command line 3. Specify the path to the Rscript.exe (or just "Rscript.exe" if the file path is set in environmental variables) 4. Specify the path to the GlycoBinder.R file 5. Specify the path to the working directory using --wd flag 6. Specify peptide labelling reagent after --reporter_ion flag (values supported by RawTools are allowed: TMT0, TMT2, TMT6, TMT10, TMT11, iTRAQ4, iTRAQ8), e.g. --reporter_ion TMT6 7. Specify additional arguments (s. below) Supposing that .raw files, the .fasta file, and the GlycoBinder.R script are located in C:/data, and peptides were labelled using TMT6plex reagents, the minimum required input would look like: C:/data>Rscript.exe "GlycoBinder.R" --wd "C:/data" --reporter_ion TMT6

Additional parameters to GlycoBinder
Following parameters modify default GlycoBinder behavior if added as command line arguments: The parameter specifies the number of amino acids around the modification site. It is applied to extract sequence window around modification site from protein sequences. Sequence windows are needed to combine quantitative information on glycoform level. Default paramter is 7, e.g. --seq_wind_size 7. Seven amino acid before the modified site and seven amino acids after the modified site will be extracted, resulting in the 15 amino acids long sequence window.

Default parameters for external tools
Per default, external tools are used with parameters listed below. The majority of the parameter cannot be changed through GlycoBinder. However, one can execute those tools outside of GlycoBinder using a different parameter set and then supply the output files into the respective folder within the GlycoBinder working directory (specified after --wd flag while running the script). In this case, GlycoBinder skips execution of a respective tool.

RawTools rawtools -parse -d [input directory] -out [output directory] -q -r [reporter ions type] -R -u
RawTools output one _Matrix.txt file per .raw file. Output file names are created by appending _Matrix.txt to the .raw file name including extension (example: "raw_file.raw" becomes "raw_file.raw_Matrix.txt"). RawTools output files are located in ./rawtools_output folder within the specified working directory. One can process raw files externally and then copy the resulting _Matrix.txt files into the ./rawtools_output folder. If every .raw file has a corresponding _Matrix.txt file, GlycoBinder will skip RawTools processing.

pParse pParse.exe -D [file] -O [output directory] -p, 0
pParse output files are located in ./pparse_output folder and named as original .raw files with .raw file extension substituted by _[Type of Detector, e.g. CDFT or ITFT].mgf. Similarly, GlycoBinder processing is skipped if all output files are found within the ./pparse_output folder. After merging of MS2 and MS3 spectra, MS2 spectra within pParse output files are substituted by the combined MS2/MS3 spectra. The modified pParse output files are renamed to [base_raw_file_name]_pParse_mod.mgf files and saved in the same ./pparse_output folder. If all _pParse_mod.mgf are found in the ./pparse_output folder, pParse processing and merging of the MS2 and MS3 spectra are skipped.

pGlycodb.exe [pglyco configuration file] && pGlycoFDR.exe -p [pglyco configuration file] -r [output file name] && pGlycoProInfer.exe
pGlyco 2 workflow consist of three programs, pGlycodb.exe, pGlycoFDR.exe, and pGlycoProInfer.exe that are executed one after another and rely upon configuration file that should be created before the first program has been called. If GlycoBinder does not find any file with a name pGlyco_task.pglyco in the working directory, it will create a configuration file with default parameters. One can create its own configuration file, e.g. using graphic user interface of pGlyco 2, name it as pGlyco_task.pglyco and then copy it to the working directory of GlycoBinder. In this case, pGlyco 2 will utilize the existing parameter file for glycopeptide search. Following parameters are used per default and can be changed when supplying a GUI-created pGlyco_task.pglyco file to the GlycoBinder working directory: When using a protease different from trypsin, it is important to assure that: a) the protease is configured in "./pGlyco/2.2.1/bin/enzyme.ini" file and b) pGlyco 2 configuration file is located in the working directory of GlycoBinder. Change "enzyme=Trypsin_KR-C" to "enzyme=[Name_of_enzyme]" in the configuration file and save it under "pGlyco_task.pglyco". Similar procedure applies when configuring pGlyco 2 search with a different set of modifications. Other parameters in the configuration file will be overwritten irrespectively of the origin of the configuration file. The same parameter file will be used in the second pGlyco 2 search.
The output file is pGlycoDB-GP-FDR-Pro.txt for the first pGlyco 2 search and pGlycoDB-GP-FDR-Pro2.txt for the second search, respectively. Both files are located in the ./pglyco_output folder. If the file pGlycoDB-GP-FDR-Pro.txt exists (or pGlycoDB-GP-FDR-Pro2.txt exists and --no_second_search flag was not used), GlycoBinder will skip the first (or first and second) pGlyco 2 search, respectively.

Special case: MS2 data
After processing with RawTools, files that were identified as not containing MS3 scans will not be subjected to msconvert processing. The MS2/MS3 spectra merging step is skipped as well. After pParse processing, original pParse output files are renamed to _pParse_mod.mgf files for consistency and used as input for pGlyco directly.

Merging of MS2/MS3 spectra
GlycoBinder combines MS2 and MS3 spectra based on MS2 and MS3 spectra scan number pairs in the RawTools output files (MS2ScanNumber and MS3ScanNumber columns within _Matrix.txt file). First, ions from MS2/MS3 scan pairs are roughly matched using 1 Th tolerance window. Initially matching ions are then tested to satisfy the specified tolerance window (1 ppm per default, it can be changed by specifying --tol_unit and --match_tol arguments). If several ions matches the same ion, the ions with the minimal absolute mass difference are considered as a matching ion pair. Intensities of matched ions are summed. Remaining MS3 ions that do not have matching MS2 ions are simply added to the MS2 spectra. pParse .mgf file then will output merged MS2/MS3 spectra. GlycoBinder matches spectra in the pParse output file to the merged MS2/MS3 spectra based on the scan number. While scan number is unique for merged MS2/MS3 spectra, several spectra in the pParse output can refer to the same scan number. For all of them, the spectrum will be substituted by the respective merged MS2/MS3 spectrum. Spectra that do no share scan number with merged MS2/MS3 spectra will be kept unchanged.

GlycoBinder output
GlycoBinder stores the output of the external tools in separate folders: rawtools_output, msconvert_output, pparse_output, and pglyco_output. After the processing, files located in rawtools_output, msconvert_output, and pparase_output folders can be removed. However, keeping those files would allow for faster data re-processing, as GlycoBinder can skip certain processing steps (e.g. .mgf generation by msconvert) if the output files generated by the external tool is already present. Differently, the pglyco_output folder contains not only the pGlyco 2 output files from the first and the (optional) second glycopeptide search (pGlycoDB-GP-FDR-Pro.txt and pGlycoDB-GP-FDR-Pro2.txt, respectively), but it also contains result data tables created by GlycoBinder: 1. pglyco_quant_results.txt The table represents a combination of pGlyco 2 output (pGlycoDB-GP-FDR-Pro.txt or pGlycoDB-GP-FDR-Pro2.txt) and RawTools output files (_Matrix.txt files). Quantitative information from RawTools output is merged with pGlyco 2 output file based on the .raw file name and MS2 scan number. Each row represents an identified spectrum (pGlyco 2) with extracted reporter ion intensities (by RawTools). Column names from pGlyco 2 and RawTools are preserved and their descriptions can be found in the documentation for pGlyco 2 and RawTools.

pGlyco_Scans.txt
The table is pglyco_quant_results.txt table filtered in the accordance with the total FDR cutoff (lesser than 2% FDR per default, the default cutoff can be changed when specifying --pglyco_fdr_threshold parameter). Column id is added for cross-reference with following tables.

pGlyco_modified_peptides.txt
The table is based on pGlyco_Scans.txt. Each row contains information about a modified peptide (glycopeptide) a peptide with a specific glycan composition. Scans belonging to the same modified peptide are combined and their reporter ion intensities are summed. Additional variable modifications of the peptide are not taken into account. Accordingly, reporter ion intensities are combined if the glycopeptide is identified in different .raw files. Glycopeptides carrying a missed cleavage site are considered as individual glycopeptides and not merged with their fully cleaved counter-parts. Precursor information from different scans is concatenated using the default pGlyco 2 separator ("/"). pGlyco_ids column refers to id column in the pGlyco_Scans.txt table and can be used to identify original scans contributed to a particular glycopeptide. Leading_Protein and Leading_ProSite columns report a representative protein from the protein group and corresponding glycosylated site based on the selection criteria discussed below.

pGlyco_glycoforms.txt
The table is based on pGlyco_modified_peptides.txt. Each row contains information about a glycoforma glycan attached to a particular site on the protein sequence. Amino acids surrounding the glycosylated site form a sequence window. Sequence window in combination with glycan composition is used to distinguish different glycoforms. Sequence windows are first extracted from the amino acid sequences of corresponding proteins. Per default, +/-7 amino acids are extracted around the modification site (the number can be changed if specifying --seq_wind_size parameter). Glycopeptides that share the same modification site are grouped together and form a peptide group. Sequence windows are extracted from proteins that contain those peptides and ranked based on the number of peptides each sequence window can explain. Ties are broken by protein ranking (see description below). Peptides shared among several sequence windows are assigned to the sequence window that encompasses the majority of the peptides within the peptide group. If there are peptides that cannot be explained by the leading sequence window, those peptides are distributed between other sequence windows accordingly. Intensities of glycopeptides sharing same sequence window (seq_win) and glycan composition (Glycan(H,N,A,G,F)) are summed. Descriptive information is concatenated using ";" as a separator. modpept_ids refers to the id column in the pGlyco_modified_peptides.txt table. It contains the ids of glycopeptides that contributed to a particular glycoform.

pGlyco_glycosites.txt
The table is based on pGlyco_modified_peptides.txt table. Each row contains information about a glycositea glycosylated site on a protein sequence irrespective of particular glycan composition. Since there might be certain ambiguity in assignment of peptides to proteins, sequence windows are used to define the glycosite in practice. Accordingly, the table contains seq_win column with sequence window information, modpept_id column that refers to id column in the pGlyco_modified_peptides.txt. Intensities of glycopeptides sharing the same sequence window are summed. Descriptive information is concatenated using ";" as a separator. Leading_Protein and Leading_ProSite are selected according to protein rank. Proteins are ranked based on the number of unique peptides (highest priority), number of all peptides, number of glycoforms assigned to the protein, whether it is a Swiss-Prot entry, and whether it is a canonical sequence or is an isoform (lowest priority). Proteins that have greater numbers of unique peptides/total peptides/glycoforms, are annotated in Swiss-Prot and represent a canonical sequence, receive a higher rank. The highest rank is 1. The rank is unique and ties, if occur, are broken by alphabetic order.

pGlyco_glycans.txt
The table is based on pGlyco_modified_peptides.txt table. Each row contains information about a unique glycan composition identified in the data set. The information about the peptide sequence is not taken into account. The inten information is combined based on glycan composition only (Glycan(H,N,A,G,F) column). modpept_id column refers to id column in the pGlyco_modified_peptides.txt and can be used to link glycopeptides carrying a particular glycan composition. Columns pGlyco_ids, Scan, Leading_Protein, Leading_ProSite are concatenations of respective columns in pGlyco_modified_peptides.txt using ";" as a separator.

Special case: use of another search engine
Currently, pGlyco 2.0 is the only search engine supported by the GlycoBinder workflow. However, GlycoBinder reports merged MS2/MS3 spectra in mgf format that are located in ./pparse_output folder and marked with the "_mod.mgf" suffix. These mgf files can be used with any other search engine compatible with the mgf format. The search engine output then has to be integrated with the quantitative data from RawTools output ("_Matrix.txt" files in ./rawtools_output folder) manually. Scan numbers and raw file names can be used to integrate qualitative and quantitative information, respectively.

Demonstration data set
As a test data set, we provide an IgM_TMT0.raw file. It is a tryptic digest of a purified IgM sample labeled with TMT0 reagent. The file is located in the "demo" folder together with a Human_IgM.FASTA file containing amino acid sequences of the two human proteins, IgM and IgJ, respecitvely. To test the performance of the GlycoBinder, download the contents of the "demo" folder (e.g. into C:/data/Glycobinder/demo), copy the current version of GlycoBinder into it and execute in the command line using following parameters: C:/data/Glycobinder/demo>Rscript.exe "GlycoBinder.R" --wd "C:/data/Glycobinder/demo" --reporter_ion TMT0 --no_second_search If you download files from GitHub using git bash, please first install git lfs (https://git-lfs.github.com/) that is aimed at handling large files (e.g. the example raw file). If you use web interface for downloading ("download ZIP"), you will download a placeholder for IgM_TMT0.raw file. To download the actual file, find it in the GitHub repository and click on "View raw". Save the file in your local "demo" folder.
The execution takes around 5 min on a desktop computer running Windows 10 and equipped with Intel Core i7-6700 CPU (64 bit) and 32 Gb of RAM. Beware that the execution time will scale up with the complexity of the data set provided.