Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform

Liquid chromatography (LC) coupled with data-independent acquisition (DIA) mass spectrometry (MS) has been increasingly used in quantitative proteomics studies. Here, we present a fast and sensitive approach for direct peptide identification from DIA data, MSFragger-DIA, which leverages the unmatched speed of the fragment ion indexing-based search engine MSFragger. Different from most existing methods, MSFragger-DIA conducts a database search of the DIA tandem mass (MS/MS) spectra prior to spectral feature detection and peak tracing across the LC dimension. To streamline the analysis of DIA data and enable easy reproducibility, we integrate MSFragger-DIA into the FragPipe computational platform for seamless support of peptide identification and spectral library building from DIA, data-dependent acquisition (DDA), or both data types combined. We compare MSFragger-DIA with other DIA tools, such as DIA-Umpire based workflow in FragPipe, Spectronaut, DIA-NN library-free, and MaxDIA. We demonstrate the fast, sensitive, and accurate performance of MSFragger-DIA across a variety of sample types and data acquisition schemes, including single-cell proteomics, phosphoproteomics, and large-scale tumor proteome profiling studies.

1.There is a lack of comprehensive comparison with the current state-of-the art software solutions.For comparison, it is necessary to show coverage, reproducibility, quantification accuracy and computation time.It is also necessary to show the consistency among results, i.e. how many proteins shared, how many gain, and how many lose?However, in the current work, the authors show only one or two aspects per dataset comparing to one or two other software solutions, which is not comprehensive enough.
2. For speed comparison, the MSFragger-DIA starts from mzml while the others start from raw, which is not a fair comparison.
3. The authors name MSFragger-DIA as a one-stop analysis pipeline, but it needs DIA-NN for quantification, R packages for normalization and ProteoWizard for data conversion and demultiplexing, which is not one-stop at all. 4. Comparison with spectronaut, the authors compared to different version of spectronaut, from version 13 to version 15.These versions are quite old.The newest version of spectronaut 16 with directDIA2.0 performs much better than these versions.The current comparison is miss leading.

5.
The current work doesn't show any benchmarking on timsTOF pro data, which is nowadays a main platform for proteomics research, and should be supported by the software.

6.
Comparison with MaxDIA: it mentions that on one dataset, MaxDIA shows low numbers of quantified peptides and on the other dataset, MaxDIA shows a very high rate of missing quantification values.Is it common on different datasets?or is it just specific to the two datasets?What can be the possible reason?7.Comparison with DIA-NN: we tested DIA-NN and MSFragger-DIA on the dataset ccRCC, which is used for generating Figure 6.Without missing value filtering, DIA-NN in silico library identified 9585 unique genes while FP-MSF identified 8317 unique genes in our result.Even after 50% missing value filtering, DIA-NN in silico library identified 7700 protein groups, 10% more than that shows in Figure 6.In our results, the missing values by DIA-NN and FP-MSF are comparable.Considering computation time, including that for data conversation, the computation time by FP-MSF is faster than DIA-NN in silico library but not significant when analyzing all the 187 files.Our result is not consistent with that in the paper.8. FDR evaluation, it is suggested to test the FDR using the cohort data (ccRCC) and using entrapment sequences, e.g.E. coli, yeast, A. thaliana.The current benchmarking is not enough.It mentions that "Replacing the experimental fragment peaks with the predicted peaks resulted in a lower actual FDR (Supplementary Table S2)."What is the reason?Normally, experimental library is more specific to the DIA data and should perform better.9. Single cell proteomics, the current protein identification number is lower than many state-of-the art reports (https://doi.org/10.1101/2022.06.28.498038), which of course is related to the dataset.The authors should choose timsTOF pro datasets for test, where > 2000 or even 3000 proteins were reported for single cell proteomics.10.Phosphoproteomics, synthetic phosphopeptides dataset (e.g.Nature Communications 2020, 11, 787) should be adopted to demonstrate the false localization rate of phosphate.For phosphoproteomics, it should be compared to spectronaut instead of DIA-NN, where the latter is not yet optimized for phosphoproteomics.
11. Different methods were suggested for MSFragger-DIA, those with and without DDA data library, and with in-silico library.It should be compared the identification consistency and quantification accuracy of the different methods, and it should be discussed which one is recommended under what condition.
Minor: gas phase fractionation (GFP) should be GPF.
Reviewer #2 (Remarks to the Author): The reviewer would like to answer the following questions: What are the noteworthy results?A fast and sensitive platform for DIA identification and quantification.
Will the work be of significance to the field and related fields?How does it compare to the established literature?If the work is not original, please provide relevant references.Yes, the work is of significance.Comparing to in silico library-based DIA-NN, MSFragger-DIA is faster, and has similar or better DIA quantification results.Does the work support the conclusions and claims, or is additional evidence needed?Yes, it does.
Are there any flaws in the data analysis, interpretation and conclusions?Do these prohibit publication or require revision?Yes, please see the below comments for minor revision.
Is the methodology sound?Does the work meet the expected standards in your field?Yes, it sounds good, and meet the expected standards.
Is there enough detail provided in the methods for the work to be reproduced?Yes, it is.
The authors described the algorithm of MSFragger-DIA, compared to other tools (DIA-Umpire, Spectronaut, DIA-NN, MaxDIA), demonstrated the advantage of MSFragger-DIA (faster, similar or better than DIA-NN).The way of building libraries and the pipeline for identification make MSFragger-DIA different from other tools.If the authors can follow the minor suggestions, the manuscript will become better.
Here are the minor suggestions.1. Figure 2 shows FP-MSF and FP-MSF hybrid outperform DIA-NN and Spectronaut at 1% FDR.How about different FDR cutoff (e.g.0.5%, 5%)?More generally, for one DIA run, draw the ROC curve (xaxis: FDR, y-axis: precursors#) for each tool, which will show the global comparison.These figures can be put in the supplementary materials.2. Supplementary Figure 1a is library-based analysis; 1b is library-free analysis.This is not correct (both are library-based).The correct way is: supplementary Figure 1a is FP-MSF hybrid; 1b is FP-MSF.3. In the main text lines 226-227, MaxDIA gave unreasonably low numbers of quantified peptides in these data, and thus was excluded from plotting (see Supplementary Figure 2a and 2c).Is this a bug for MaxDIA? 4. In Table S1, it is 90 min gradient for 2020-Yeast; in the main text lines 213-216, it is 115 min gradient for 2020-Yeast.They are not consistent.

Reviewer #1:
In the work, the group of Prof. Nesvizhskii reports MSFragger-DIA for DIA proteomic data analysis.MSFragger-DIA starts with a direct search of MS/MS spectra against the entire protein sequence database to generate a list of peptide candidates.Then, MSFragger-DIA traces peaks, extracts ion chromatograms, and detects features of all candidate peptides for each spectrum determined.Finally, MSFragger-DIA generates output files compatible with PeptideProphet and Percolator for rescoring and FDR estimation.In general, it proposes a qualification and quantification pipeline for DIA data analysis, supporting DIA analysis without DDA, phosphoproteomics, single cell proteomics, etc.However, the current benchmarking doesn't demonstrate the superior performance of MSFragger-DIA compared to the state-of-the art software solutions.

The specific comments:
1.There is a lack of comprehensive comparison with the current state-of-the art software solutions.For comparison, it is necessary to show coverage, reproducibility, quantification accuracy and computation time.It is also necessary to show the consistency among results, i.e. how many proteins shared, how many gain, and how many lose?However, in the current work, the authors show only one or two aspects per dataset comparing to one or two other software solutions, which is not comprehensive enough.
Response: Thanks for pointing this out.We have added more comparisons and analyses to Figure 2 and 3.In the new analysis, there are upset plot (Figure 2a) to demonstrate the overlap among all results, box plot (Figure 2b) to evaluate the false positives and sensitivity, violin plot (Figure 2c) to compare the quantification precision among all tools, and LFQbench-style plot (Figure 2d) to benchmark the quantification accuracy.There is also additional supporting evidence in Figure S2 and Table S2.In Figure 3b , and S3.We have also included a recently released tool, Spectronaut 17, to ensure our benchmarking remains up to date.Our updated experiments encompass comprehensive analyses from various perspectives, including sensitivity, false discovery rate control, overlap percentages among results from all tools, precision, and accuracy.Additionally, we have integrated another new tool, the latest version of EncyclopeDIA, into the benchmarking presented in Figures 3 and S3.Further evaluations of quantification precision have been conducted as well.It is also not clear what is the main innovation point of the current work.Is it just to provide another software solution or are there key advances of the pipeline?The most concept of the current work, including fragment ion indexing, has been proposed by the group or others in the field.Response: There are two aspects of the innovations.Firstly, we proposed a method and a new tool, MSFragger-DIA, to directly search the DIA spectra to identify the peptides.As outlined in the introduction section, current methods necessitate either spectral libraries or chromatography feature detections, both of which are time-consuming and costly.Our study demonstrates that it is feasible to obtain good results by directly searching the DIA spectra without relying on experimental or in silico spectral libraries.Secondly, we have created a comprehensive software suite designed to streamline and ensure the full reproducibility of DIA data analysis.Our new FragPipe, in conjunction with MSFragger-DIA, has already facilitated the researcher to perform DIA data analysis easily.We have highlighted the innovations in the abstract (line 23-29), introduction (line 110-114), and discussion (line 466-474).
, we have added the boxplot to show the quantification precision from the overlapping peptides and unique peptides.In the phosphoproteomics data analysis (Figure 5), we also use the upset plot, which is used to evaluate the overlap, to replace the bar plot.All those new analyses and figures demonstrate the excellent performance of our tools from the aspects of quantification sensitivity, precision, accuracy, and false discovery controls.We have also added the description to line 172-237, line 260-271, line 504-507, line 518-537, line 540-558, line 570-578.2. For speed comparison, the MSFragger-DIA starts from mzml while the others start from raw, which is not a fair comparison.Response: We apologize for any confusion caused.We used the same mzML files for DIA-Umpire, EncyclopeDIA, DIA-NN, and MSFragger-DIA for fair comparison.DIA-NN does not support the raw format in Linux, while EncyclopeDIA lacks raw format support in all operating systems.We used the raw files for MaxDIA and Spectronaut as they exhibited better performance with raw files.The time required for format conversion was minimal, at approximately 7 minutes in total, compared to the runtime for the fastest tool, which is around 100 minutes (Figure 6).We have added the clarification to line 530-534, line 319-321, and line 351-353.3. The authors name MSFragger-DIA as a one-stop analysis pipeline, but it needs DIA-NN for quantification, R packages for normalization and ProteoWizard for data conversion and demultiplexing, which is not one-stop at all.Response: As emphasized in the title and main context, the one-stop analysis pipeline encompasses MSFragger-DIA and FragPipe.Users are not required to install DIA-NN, as the quant module is already integrated into FragPipe.The R script serves to filter the results and calculate the MaxLFQ intensities, ensuring fair comparisons.In real-world applications, users do not need to execute any R scripts while utilizing our tools.Furthermore, ProteoWizard is not necessary for un-staggered data.We have added clarifications to line 527-537.We have also changed the title by removing "one-stop".4. Comparison with spectronaut, the authors compared to different version of spectronaut, from version 13 to version 15.These versions are quite old.The newest version of spectronaut 16 with directDIA2.0 performs much better than these versions.The current comparison is miss leading.Response: We appreciate your suggestion.We have incorporated the results from the most recent Spectronaut 17, which contains directDIA+, to Figure 2, Figure 3, and runtime comparison (line 356-357, line 504-507, and line 683-685).Unfortunately, we do not have access to Spectronaut for the other datasets, as it is a commercial software and is not available to us. 5.The current work doesn't show any benchmarking on timsTOF pro data, which is nowadays a main platform for proteomics research, and should be supported by the software.