DeepPhospho: Accelerate DIA phosphoproteome proling by Deep Learning

Phosphoproteomics integrating data-independent acquisition (DIA) has enabled deep phosphoproteome proling with improved quantication reproducibility and accuracy compared to data-dependent acquisition (DDA)-based phosphoproteomics. DIA data mining heavily relies on a spectral library that in most cases is built on DDA analysis of the same sample. Construction of this project-specic DDA library impairs the analytical throughput, limits the proteome coverage, and increases the sample size for DIA phosphoproteomics. Herein we introduce a novel deep neural network, DeepPhospho, which conceptually differs from previous deep learning models to achieve accurate predictions of LC-MS/MS data for phosphopeptides. By leveraging in silico libraries generated by DeepPhospho, we established a new DIA workow for phosphoproteome proling which involves DIA data acquisition and data mining with DeepPhospho predicted libraries, thus circumventing the need of DDA library construction. Our DeepPhospho-empowered workow substantially expanded the phosphoproteome coverage while maintaining high quantication performance, which led to the discovery of more signaling pathways and regulated kinases in an EGF signaling study than the DDA library-based approach. DeepPhospho is provided as a web server to facilitate user access to predictions and library generation.


Introduction
Protein phosphorylation is a widespread post-translational modi cation (PTM) that regulates essentially all cellular signaling networks 1 . Mass spectrometry (MS)-based phosphoproteomics has become the method of choice for the genome-wide study of protein phosphorylation and dynamic cell signaling 2 .
However, conventional phosphoproteomics based on data-dependent acquisition (DDA) often suffers from limited throughput and low reproducibility due to current MS sequencing speed and semi-stochastic sampling of DDA 3 . With the advent of data-independent acquisition (DIA) to enable proteome pro ling of large cohorts of samples with superior quanti cation accuracy and reproducibility 4,5 , DIA-based phosphoproteomics has emerged as a powerful technology for cell signaling study 6 , proteogenomic characterization of clinical cancer tissues 7 and anti-viral drug discovery 8 . Importantly, a benchmark study by Olsen J et al. has demonstrated that DIA phosphoproteomics achieves a larger dynamic range, higher reproducibility of identi cation, and improved sensitivity and accuracy of quanti cation than DDA phosphoproteomics 3 .
However, the current DIA phosphoproteomic work ow faces a signi cant limitation which is the need of a high-quality spectral library to be constructed prior to data processing. In almost all reported DIA phophoproteomic analysis, a project-speci c DDA library was built through DDA analysis of extensively pre-fractionated or repeatedly injected samples 3,[6][7][8][9] . Although project-speci c DDA libraries afford a higher proteome coverage (i.e. covering a larger number of protein and peptide identi cations) than other experimental libraries, they are built at the expense of time, sample, and considerable efforts with prefractionation 10 . Alternatively, the previous elegant study showed it is feasible to build a direct DIA library representation as each amino acid was enriched by the features of other amino acids in the same peptide. However, the effective context encoded by the bi-LSTM network is often limited due to loss of information in its recurrent updates 16 . To capture long-range dependency in the peptide sequences, we then introduced the second module, a Transformer network that re nes the peptide representation generated from the rst module. The transformer network used a multi-head self-attention to update the features of all amino acids in parallel, and enabled the model to directly attend to multiple sites of the peptide even if they were far apart. Finally, the network output a new representation for the input peptide, which was then fed into a linear regressor network to generate predictions for RT or ion intensities.
To tailor DeepPhospho speci cally to phosphopeptide prediction, we introduced a set of extra tokens to represent different phosphorylated amino acids, and learn their embedding jointly with the base peptides. In addition, for the task of fragment ion intensity prediction, we designed a modi ed loss training that enforced the structural constraints of the corresponding peptides. In particular, we ignored the loss terms on the phosphate moiety that cannot exist and lter out the model predictions on those ions.
To the best of our knowledge, DeepPhospho is the rst work to utilize the Transformer for the prediction of peptide fragmentation patterns though it has been extensively used in the natural language processing 17,18 . To demonstrate the advantage of our model design, we conducted an ablative study to compare our model with the bi-LSTM or the Transformer alone, and combination of CNN with the Transformer using two phosphoproteomic datasets (Supplementary Table 1). Interestingly, our hybrid model consistently outperformed those alternative baselines, indicating that DeepPhospho is able to learn a better feature representation for phosphopeptides, and the bi-LSTM and the Transformer are complementary in learning the peptide representation ( Supplementary Figs. 1b-1d).

Accurate prediction of fragment ion intensity and retention time for phosphopeptides
After the model architecture test, DeepPhospho was pre-trained using four large-scale phosphoproteomic datasets (details in Methods; Supplementary Table 1). We then used DeepPhospho to make predictions for phosphopeptides in three new datasets acquired on Q Exactive HF-X and Orbitrap Fusion Lumos mass spectrometers from two laboratories (Supplementary Table 1). Two datasets (RPE1 DDA and RPE1 DIA) both collected from RPE1 cells 3 , one by DDA, the other by DIA acquisition methods, were searched by MaxQuant and Spectronaut respectively to yield phosphopeptide identi cation results. These identi cation output les were then imported into Spectronaut to generate a DDA library and a direct DIA library which both contained MSMS spectra, PTM site localization and iRT data for identi ed phosphopeptides in the same format. Data in each library was split at a ratio of 8:1:1 for model training, validation and test separately. The trained DeepPhospho model achieved excellent overall agreement between the experimental and predicted fragment ion intensities for the test set (median Pearson correlation coe cient (PCC) = 0.968, median spectral angle (SA) = 0.881 for RPE1 DDA; median PCC = 0.903, median SA = 0.791 for RPE1 DIA) (Fig. 1b). Furthermore, DeepPhospho enabled accurate iRT prediction for both datasets after model training (median absolute error (MAE) = 1.74 units for RPE1 DDA, 1.86 units for RRE1 DIA) (Fig. 1c, Supplementary Fig. 2a). For the third dataset (U2OS DIA) which is another DIA library generated from phosphoproteome pro ling of U2OS cells 9 , DeepPhospho made equally accurate predictions of fragment ion intensity and iRT (Supplementary Figs. 2b,2c).
We also compared the performance of DeepPhospho in phosphopeptide fragment ion intensity prediction with three recently reported models. pDeep2 built on an LSTM model also allows for transfer learning with a training set 19 . DeepMS2 initially predicts for non-phosphopeptides and generates in silico MSMS spectra for the modi ed peptides using a "budding" strategy 20 . MS2PIP predicts fragmentation patterns directly from phosphopeptide sequences using an XGBoost machine learning algorithm 21 . In all cases, DeepPhospho outperformed the reported models when tested with the same phosphoproteomic datasets (Fig. 1b, Supplementary Fig. 2b).
Because the correlation between experimental and predicted fragment ion intensities was apparently lower for DIA data than DDA data (Fig. 1b) and our major attempt was to enhance DIA data mining, we looked into the PCC histograms for two DIA datasets examined above. We rst selected phosphopeptides with a low correlation (PCC < 0.3) which indicates signi cant disparity between their DIA library spectra and predicted spectra (Fig. 2a upper). To determine which spectrum is more likely to be correct for a given phosphopeptide, we obtained the reference spectrum from the pro ling results of the same or very similar phosphoproteome samples analyzed by gold-standard DDA acquisition methods 1,9 (Supplementary  Table 1). Notably, the majority of phosphopeptides (78.6% in RPE1 DIA data, 82.8% in U2OS DIA data) showed a stronger correlation between the reference and predicted spectra than between the library and predicted spectra (Fig. 2a lower). This result suggests that the fragmentation patterns for these phosphopeptides were more accurately predicted by DeepPhospho than experimentally assigned in the DIA library.
To verify this nding, we synthesized seven phosphopeptides and acquired the bona de high-quality MSMS spectra by targeted MS analysis of the synthetic peptides. All predicted spectra were closely correlated with the bona de spectra (PCC 0.79 ~ 0.97) whereas the DIA library spectra for the same phosphopeptide sequences showed much lower correlations (PCC -0.61 ~ 0.20) (Fig. 2b). Mirror plots for speci c phosphopeptides also re ected strong agreement between our prediction and bona de measurement yet considerable discordance with the DIA library spectra (Fig. 2c, Supplementary Fig. 3). Taken together, DeepPhospho enables accurate prediction of fragment ion intensity for phosphopeptides, which in some cases could pinpoint false identi cations in the DIA library.

Constructing Deepphospho Predicted Libraries For DIA Phosphoproteomics Data Mining
In routine DIA data analysis, a project-speci c DDA library has to be built based on peptide identi cations from a separate DDA experiment 22 . Raw DIA data are then processed for peptide identi cation and quanti cation using a peptide-centric scoring algorithm 23 against this experimental DDA library.
Particularly, in a DIA phosphoproteomic experiment, to construct a conventional DDA library, one would need to go through an extensive procedure of phosphopeptide preparation, enrichment, pre-fractionation and LC-MS/MS analysis which takes weeks to months to complete the data acquisition (Fig. 3a). Alternatively, a direct DIA library can be generated by searching the raw DIA data directly and exploited for DIA data mining. Construction of this DIA library typically requires single-injection DIA data acquisition which can be nished within days, thus largely saving instrument time and precious samples (Fig. 3a). However, up till now, the proteome coverage of a DIA library still lags behind that of an extensive DDA library, which greatly limits the depth of proteome pro ling if using the DIA data alone 3,5 .
To investigate whether and to what extent in silico spectral libraries can deepen DIA phosphoproteome pro ling, we designed six types of predicted libraries or hybrid libraries to be assessed in parallel with the project-speci c DDA library namely Lib 1 (Fig. 3b): Lib 2, a predicted DDA library; Lib 3, a hybrid of the direct DIA library and the predicted DDA library; Lib 4, a hybrid of the direct DIA library and the predicted library from a public phosphoproteome database; Lib 5, a hybrid of the direct DIA library and the predicted library from a public phosphosite database; Lib 6, a hybrid of the predicted DIA library and the predicted DDA library; Lib 7: a hybrid of the predicted DIA library and the predicted library from a public phosphoproteome database. Of note, all predicted libraries in Lib 2 to Lib 7 were generated by DeepPhospho based on the phosphopeptide sequences and charge states recorded in the corresponding experimental libraries or databases (Fig. 3a).  9 . In this experiment, a project-speci c DDA library was built from 20 DDA runs of the same phosphoproteome samples. After training DeepPhospho with U2OS DIA data, Lib 2 to Lib 7 were generated based on predictions for phosphopeptides identi ed in the DDA or direct DIA library, and phosphopeptide sequences registered in a human phosphoproteome database 24 (hPhosPepDB) or computed from a human phosphosite database 25 (hPhosSiteDB) ( Fig. 3b and Supplementary Table 2). Speci cally, for hPhosPepDB which records 204,606 human phosphopeptides identi ed in various proteomics projects, we generated 21 predicted libraries depending on the combination of precursor and fragment mass ranges, peptide length, PTM site number and charge state in different values (details in Methods). Then the best combination giving rise to the highest phosphoproteome coverage was used to generate Lib 4 ( Supplementary Fig. 4). Meanwhile, 350,719 phosphopeptide sequences were computed through collecting human phosphosites registered in EPSD database and in silico digestion of the human proteome, which were used to generate the predicted library in Lib 5. Notably, Lib 4, Lib 5 and Lib 7 comprised of a predicted library from public databases are of a larger size than the others by one order of magnitude and contain a unique fraction of phosphospeptides not present in Lib 1 (Fig. 3b, Supplementary Figs. 5a, 5b) Analysis of the U2OS DIA data with each library by Spectronaut led to varying phosphoproteome coverages (Fig. 3c). The largest increase of coverage was attained with Lib 7 which yielded 32,511 phosphopeptide and 26,353 phosphosite identi cations, compared to 25,814 phosphopeptides and 21,400 phosphosites originally identi ed with Lib 1. All phosphosites reported in our study required at least 0.75 localization con dence (Class I sites) as in previous analyses 3,26 . Lib 7 was produced by merging a small predicted DIA library with a large one predicted from hPhosPepDB. It is noteworthy that Lib 7 gained even more identi cations than Lib 6, a hybrid of the predicted DIA and the predicted DDA libraries which also outperformed Lib 1 by covering 30,552 phosphopeptides and 24,954 phosphosites. Interestingly, data analysis with Lib 5 which is mainly comprised of a predicted library from hPhosSiteDB led to no increase of coverage, which underlies the importance of selecting an appropriate database and optimizing the library construction parameters in the performance of predicted libraries built from public databases.
DIA data analysis with two best-performing predicted libraries Lib 7 and Lib 6 led to identi cations of 10,987 and 6,453 novel phosphopeptides as well as localizations of 9,177 and 5,390 novel phosphosites respectively that were absent in the analysis with Lib 1 (Supplementary Fig. 5c). The huge gains prompted us to assess the control of false discovery rate (FDR) even though < 1% FDR was automatically set at both peptide and protein levels by Spectronaut in all our data searches. For Lib 1 and three predicted libraries Lib 2, Lib 6 and Lib 7, we created a decoy library by predicting MSMS fragmentation pattern and iRT for the reverse sequence of each identi ed phosphopeptide in the target library using DeepPhospho. Searching the same dataset with a target library appended with the corresponding decoy library allowed us to assess library-speci c FDRs. FDR turned out to be equivalent between Lib 1 (0.55%) and all three predicted libraries (0.40%-0.64%) (Fig. 3d, Supplementary Fig. 5d). Thus, signi cant increase of phosphoproteome coverages by using DeepPhospho predicted libraries did not compromise the FDR control. In addition, reproducibility of phosphoproteome quanti cation between replicates was comparable among Lib 1 and all six DeepPhospho predicted libraries (Fig. 3e).

Performance Of Deepphospho Predicted Libraries In A Phosphosignaling Study
Next we used the RPE1 DIA dataset from a cell signaling study 3 to evaluate whether the advantage of DeepPhospho predicted libraries in deepening phosphoproteome pro ling can be translated to a more biological scenario. In this study, RPE1 cells were stimulated with EGF in the absence or presence of two MEK kinase inhibitors. DIA data from 18 runs were acquired from the phosphoproteome samples prepared under six conditions in biological triplicates (Fig. 4a, Supplementary Table 1). In addition, this study recorded a project-speci c DDA library consisting of 89,416 unique phosphopeptides identi ed from 147 DDA runs to analyze extensively pre-fractionated samples 3 . After training DeepPhospho model with RPE1 DIA data, we created ve predicted or hybrid libraries following the design of Lib 2, Lib 3, Lib 4, Lib 6, and Lib 7 as described above. Lib 5 was abandoned here because of its poor performance in the previous evaluation. For the construction of Lib 4, we also compared 21 different combinations of phosphopeptide and precursor features so as to yield a predicted library from hPhosPepDB with the highest phosphoproteome coverage ( Supplementary Fig. 6). Both Lib 4 and Lib 7 comprised of the largest predicted library from hPhosPepDB exceeded Lib 1 in size, with each having a unique fraction of phosphopeptide identi cations ( Supplementary Fig. 7a). Then the RPE1 DIA data was processed with different libraries to give rise to pro ling results.
Given that the biological goal of this study was to map regulated phosphosites at different conditions so as to characterize EGF-dependent phosphosignaling in the context of MEK inhibition 3 , we focused on quanti able phosphopeptides and phosphosites with ratios measured between any two experimental conditions. Not surprisingly, all ve DeepPhospho predicted libraries outperformed the extensive projectspeci c DDA library (Lib 1) by increasing the total number of quanti able phosphopeptides and phosphosites (Fig. 4b). The winner was Lib 6 which gained 17.9% and 14.9% more quanti cations of phosphopeptides and phosphosites relative to Lib 1 (Fig. 4b).
To further deepen the phosphoproteome coverage especially for the quanti able portion, we explored an iterative search strategy. Phosphopeptides identi ed from the initial search with a speci c library were selected to build a focused library which was used to iteratively search the raw DIA data with the same parameters as the initial search (Fig. 4c). These focused libraries are 3 to 14-fold smaller than the initial libraries ( Supplementary Fig. 7b). Remarkably, the iterative search with all focused libraries signi cantly increased the coverages of quanti able phosphopeptides whereas the coverages of totally identi ed phosphopeptides and non-phosphopeptides remained barely changed (Fig. 4d, Supplementary Fig. 7c). It suggests the iterative search speci cally expands the fraction of the quanti able phosphoproteome within the entire pro led proteome. Relative to Lib 1 which quanti ed 14,274 phosphopeptides and 12,726 phosphosites, Lib 7 showed the best performance by quanti cation of 17,366 phosphopeptides (21.7% increase)and 14,994 phosphosites (17.8% increase). We speculate that the iterative search enhances the sensitivity of detecting phosphopeptides identi ed in the initial search, which results in fewer missing values to facilitate ratio measurement of more peptides.
Importantly, we also assessed the FDR control of both initial and iterative searches on this dataset using the target-decoy strategy 27 . Lib 6 and Lib 7 had even smaller error rates (0.29% and 0.41% for initial searches, 0.29% and 0.21% for iterative searches) than Lib 1 (0.71% for initial search, 0.88% for iterative search) ( Supplementary Fig. 7d). Taken together, we demonstrated the application of an iterative search considerably promotes DIA pro ling of the quanti able phosphoproteome while strictly controlling the FDR in data mining. Moreover, Lib 7 which was built without relying on a project-speci c DDA library achieved a marked gain of coverage over the DDA library (Fig. 4d).
Based on our DIA quanti cation results from the iterative search, we performed an ANOVA statistical test to identify signi cantly regulated phosphosites at EGF or any kinase inhibitor treatment. In concordance with higher coverages, DIA data analysis with any of the ve DeepPhospho predicted libraries yielded more regulated sites than Lib 1, with Lib 6 and Lib 7 gaining additional 235 and 212 sites relative to Lib 1 (Fig. 5a). To address one of the central biological questions of this study, we performed a Tukey's range test to identify EFG-regulated sites (signi cantly changed at EGF treatment vs control) with each library which were further divided into MEK-dependent (changes at EGF treatment reversed by inhibitor treatment) and MEK-independent sites (changes at EGF treatment unaffected by inhibitor treatment). All regulated sites uncovered by different libraries are summarized in Supplementary Table 3. Again, all DeepPhospho predicted libraries outperformed Lib 1 in regard to the number of functionally regulated sites in each category (Fig. 5b). For example, data mining with Lib 6 and Lib 7 uncovered 128 and 122 novel EGF-regulated phosphosites that were not revealed with Lib 1 (Fig. 5c).
Next, we performed bioinformatics analysis based on the regulated sites to assess how much biological insights into the phosphosignaling network can be gained by data mining with DeepPhospho predicted libraries. For two libraries of our most interest (Lib 6 and Lib 7), unsupervised hierarchical clustering of the ANOVA signi cant sites identi ed by each was performed in parallel to Lib 1, which revealed very similar patterns of regulation among all stimulation conditions when comparing Lib 7 with Lib 1 (Fig. 5d) or Lib 6 with Lib 1 (Supplementary Fig. 7e). Interestingly, signaling pathway analysis based on EGFregulated sites revealed 7 additional pathways to be signi cantly enriched by results from Lib 6 and Lib 7 than from Lib 1 which only enriched one pathway (Fig. 5e). The additional pathways only enriched by DeepPhospho predicted libraries included mTOR, AKT, PKC and MAPK pathways which are well known signaling axes activated by EGF. Consistently, nine regulated kinases including AKT1, RPS6K, MAPK/MAP2K and PAK1 were signi cantly over-represented in results from DeepPhospho predicted libraries in contrast to three kinases over-represented in the result from Lib 1 by the kinase-substrate pair enrichment analysis (Fig. 5f). In summary, bioinformatics analysis of EGF-regulated sites uncovered by DeepPhospho predicted libraries recapitulated the known EGF signaling network to a much larger extent than the project-speci c DDA library.

Performance of DeepPhospho predicted libraries in a quantitative two-proteome model
The quality of large-scale phosphoproteomic studies depends on not only the proteome coverage but also the quanti cation accuracy and reproducibility. To evaluate the quanti cation performance of DIA analysis with our predicted libraries, we used another published dataset acquired from a standard twoproteome model 3 (Supplementary Table 1). In this model, phosphopeptides enriched from yeast were diluted at different ratios into a xed background of HeLa phosphopeptides, and the mixed phosphoproteome samples at ve serial dilution conditions were individually subjected to DIA data acquisition each in six injection replicates (30 DIA runs in total) (Fig. 6a). As a result, the expected ratios of yeast phosphopeptides at four conditions relative to the control would be 0.25:1, 0.5:1, 1.5:1, and 2:1 while human phosphopeptides are expected to have no changes at any condition. As usual, the previous study built an extremely extensive project-speci c DDA library consisting of 119,171 phosphopeptide identi cations by acquiring 203 runs of DDA data from yeast or human pre-fractionated phosphopeptide samples 3 . We trained DeepPhospho with the two-proteome DIA data to create six predicted or hybrid libraries based on predictions for phosphopeptides in the DDA library, direct DIA library or phosphopeptide sequences from two different resources (Fig. 6a). Driven by a major attempt to quantify the yeast phosphoproteome with a maximal coverage, we constructed Lib 4 and Lib 7 based on predictions for 36,954 yeast phosphopeptides reported in a deep yeast phosphoproteomic study using various extraction and enrichment approaches 6 (yPhosPepDB) (Supplementary Table 2). Meanwhile, Lib 8 was built to mainly contain a predicted library for human phosphopeptides registered in hPhosPepDB.
Iterative search of the two-proteome DIA data with four DeepPhospho predicted libraries (Lib 3, Lib 4, Lib 6 and Lib 7) resulted in over 20% increase of quanti able yeast phosphopeptides and phosphosites relative to Lib 1 (Fig. 6b). Remarkably, compared to 4,593 phosphopeptides and 3,957 phosphosites quanti ed with Lib 1, Lib 6 and Lib 7 yielded quanti cations of 6,815 and 6,597 phosphopeptides corresponding to 5,750 and 5,640 phosphosites respectively, both achieving more than 40% increase of coverage. This result suggested the predicted library built on a published phosphoproteomics dataset for the same species in Lib 7 performed very closely to the predicted large-scale DDA library in Lib 6. As a control, the predicted library built on a public human phosphoproteome database in Lib 8 failed to signi cantly increase the coverage of the yeast phosphoproteome (Fig. 6b). On the other hand, when comparing the coverage of the quanti able human phosphoproteme in the two-proteome model, we observed 55.1% increase of phosphopeptides and 47.6% increase of phosphosites with Lib 8 relative to Lib 1, whereas Lib 7 showed no increase (Fig. 6c).
This two-proteome model allowed us to precisely assess the quanti cation accuracy for yeast phosphopeptides serially diluted into a complex phosphoproteome background. DIA data analysis with each DeepPhospho predicted library from Lib 2 to Lib 8 yielded as accurate ratio measurement as Lib 1, with their medians of measured ratios at four dilution conditions very close to the theoretical values (Fig. 6d upper). Furthermore, our assessment of the quanti able population of human phosphopeptides as a xed background revealed equivalent accuracy achieved by each predicted library and Lib 1, with their medians of measured ratios at four dilution conditions all around 1:1 (Fig. 6d lower). In addition, box plot analysis of relative errors between measured and expected ratios demonstrated equally su cient quanti cation accuracy of both yeast and human phosphoproteomes across all tested libraries (median per cent relative errors of 7.56%-8.18% for predicted libraries and of 7.82% for Lib 1) ( Supplementary   Fig. 8a). Finally, quanti cation reproducibility of phosphopeptides between replicates was highly comparable across all tested libraries (median CVs of 9.5%-10.1% for predicted libraries and of 9.6% for Lib 1) ( Supplementary Fig. 8b). Notably, the high accuracy and reproducibility of DIA quanti cation results from DeepPhospho predicted libraries underscores their excellent performance not only in quanti cation but also in identi cation of the mixed phosphoproteome, given that only correctly identi ed phosphopeptides can yield accurate ratio measurement.

Discussion
In this study, we present a hybrid deep neural network DeepPhospho which conceptually differs from all previous deep learning models for unmodi ed or modi ed peptide predictions in regard to peptide representation learning. Our approach utilizes a multi-module network and self-attention mechanism to learn a highly expressive peptide representation, yielding more accurate predictions. When evaluated with multiple phosphoproteomics datasets acquired by DIA or DDA methods, DeepPhospho surpasses existing benchmarks and tools in the prediction of fragmentation patterns for phosphopeptides. In certain cases, the large variance between a DeepPhospho predicted MSMS spectrum and an experimentally assigned spectrum revealed the latter was a false identi cation while the predicted spectrum closely mimics the bona de spectrum. Moreover, accurate prediction of chromatographic retention time for any phosphopeptide sequence is integrated into DeepPhospho, which allows for convenient construction of in silico spectral libraries to enhance DIA phosphoproteomics data mining.
Transfer learning is a powerful approach to train deep neural networks to learn speci c features of the experimental conditions under which a proteomics dataset was acquired 19 . To generate an in silico library suitable for DIA data analysis, researchers chose to train a model such as Prosit and pDeep using data from a DDA experiment which was performed under nearly identical conditions to the DIA experiment 11,19 . Unlike previous studies, we trained DeepPhospho using the exact DIA data to be analyzed and showed the transfer learning model afforded high accuracy in prediction of fragment ion intensity and retention time for phosphopeptides in three separate datasets. In principle, a DIA datatrained model can precisely capture the DIA experiment-related parameters that determined the data structure. These parameters would re ect speci c LC and MS conditions that typically shift more or less in DDA experiments performed in the same lab on the same instruments due to internal variations of instruments and the regular change of nanoLC columns. Therefore, an in silico library generated through predictions with a DIA data-trained model is expected to perfectly match the DIA data to be analyzed. As a result, Lib 6, a combination of two predicted libraries converted from a project-speci c DDA library and a direct DIA library, enabled a substantial increase of the phosphoproteome coverage in all three datasets compared to the original experimental libraries.
In our study, we designed and evaluated DeepPhospho predicted libraries built on phosphopeptide identi cations not only from the experimental DDA or DIA libraries but also from community resources (Fig. 3a). One evident advantage of training the model with DIA data alone and building a predicted library based on public data resources is no need to perform laborious and time-consuming DDA experiments, which could be a pronounced improvement of the current DIA work ow. However, it has been recognized that in silico libraries built on public databases especially from proteome-scale prediction face a big challenge of extensive query space, which would cause reduced detection sensitivity and increased false positives 14 . Indeed, Lib 5 generated by whole-proteome computation of phosphopeptides based on a human phosphosite database had the largest library size yet the lowest identi cation rate. In contrast, Lib 4 and Lib 7 both comprised of a predicted library built on phosphopeptides recorded in a human phosphoproteome database under an optimal condition substantially expanded the human phosphoproteome coverages in three studies without compromising the FDR control. Most importantly, data analysis with Lib 7 yields a proteome coverage comparable to or even higher than Lib 6. Thus, our study established a new DIA work ow for human phosphoproteomics which circumvents the need of DDA experiments and reaches a maximal proteome coverage largely exceeding the state-of-the-art DDA library.
In a classical EGF signaling study, we further demonstrated iterative data search with the best-performing predicted libraries (Lib 6 and Lib 7) enhanced phosphoproteome pro ling to a much greater depth (21.2% average increase at the phosphopeptide level and 17.4% average increase at the phosphosite level) than a high-quality extensive DDA library. Of note, we undertook an iterative search strategy to reduce the query space for large-size predicted libraries so as to increase the detection sensitivity and deepen the proteome coverage. Meanwhile, FDR control, quanti cation accuracy and reproducibility for data analysis with predicted libraries remained as good as, or even better than, the DDA library. Remarkably, more regulated phosphosites were identi ed with Lib 6 and Lib 7, which led to the signi cant enrichment of a higher number of EGF signaling pathways and activated kinases than the DDA library. This has major implications that more biological insights could be obtained from DIA phosphoproteomics analysis if applying our new data mining work ow empowered by DeepPhospho.
DeepPhospho is provided as a web server (http://shuilab.ihuman.shanghaitech.edu.cn/DeepPhospho) to facilitate user access to predictions, retention time calibration and library generation. As showcased in our study, the ability of DeepPhospho to make high-quality predictions for phosphopeptides has enabled a fundamentally new work ow for DIA phosphoproteomics. In this work ow, only single-shot DIA data is acquired for speci c samples, and data mining completely relies on the DIA data itself and a public database without the need of constructing a project-speci c DDA library. For human phosphoproteomics studies, we provided a complete phosphopeptide input table (hPhosPepDB in Supplementary Table 2) to build an in silico library which can be exploited in other projects. Given the novel architecture and high performance of DeepPhospho, we envision it can be modi ed to make accurate predictions for nonphosphopeptides as well as peptides of diverse modi cations so as to build e cient DIA work ows for global proteomics and PTM proteomics. Undoubtedly, such work ows would accelerate current proteomics research by enhancing protein/peptide detection as well as reducing sample size and instrument time investment. In addition, we anticipate DeepPhospho to be readily applied to validation of phosphopeptide identi cations, targeted MS assay development for selected phosphopeptides in a complex background, as well as independent assessment of FDR control in DIA analysis. We believe DeepPhospho and our new DIA phosphoproteomics work ow would bene t proteomics and biological research in various ways.

Methods
Processing of external DDA/DIA MS data. For initial evaluation of the model architecture, the mouse brain DDA data 28 was downloaded from PRIDE with the identi er PXD006637 and its MaxQuant search output le was directly used. The yeast R2P2 DDA data 6 was downloaded from PRIDE with the identi er PXD013453, and the raw data were searched against the Uniprot S. cerevisiae reference proteome (7,500 protein sequences, downloaded in 2020/09) with MaxQuant 29 . MaxQuant v1.6.14.0 was used in this work with the following settings: Phospho (STY), Oxidation (M), and Acetyl (Protein N-term) were set as variable modi cations; Carbamidomethyl (C) was set as xed modi cation; tolerance of rst search and main search were 20 p.p.m. and 4.5 p.p.m.; FDR at PSM level and protein level was set to 0.01; min Andromeda score of modi ed peptides was 40. The yeast R2P2 DDA data was also used to evaluate the iRT prediction model.
For model pre-training, the mouse brain DDA data mentioned above was used again, together with three new datasets: Vero E6 DIA data 8 (downloaded from PRIDE with the identi er PXD019113), yeast DIA data 6 (downloaded from PRIDE with the identi er PXD013453), and the human phosphopeptide RT data downloaded from the supplementary data of a published work 24 (removing phosphopeptides with the phosphosite Ascore ≤13). Both DIA data were used to build direct DIA libraries using the Pulsar search engine in Spectronaut 30 v14.5 by searching against Uniprot C. sabaeus reference proteome (19,136 protein sequences, downloaded in 2020/04) or Uniprot S. cerevisiae reference proteome (7,500 protein sequences, downloaded in 2020/09). The procedure of building direct DIA libraries is descried in the session of "Spectral library generation". For evaluation of model prediction for phosphopeptides, we downloaded RPE1 DDA and RPE1 DIA data 3 from PRIDE with the identi er PXD014525 and U2OS DIA data 31 from PRIDE with the identi er PXD017476. RPE1 DDA data initially downloaded in a Spectronaut speci c .kit library format was transformed to a plain text le. Then we removed peptide entries with modi cations of Deamidation (NQ) and Gln->pyro-Glu which rarely occur and are not supported in current DeepPhospho models. RPE1 DIA data and U2OS DIA data were both searched with Pulsar to generate direct DIA libraries, and Uniport human reference proteome (UP000005640, 84,823 protein sequences, downloaded in 2020/06) was used as the sequence database. Reference spectra of phosphopeptides were obtained from two DDA-based human phosphoproteomic studies: U2OS DDA data 31 downloaded from PRIDE with the identi er PXD017476 and U-87 DDA data 1 downloaded from PRIDE with the identi er PXD009227.
For evaluation of DeepPhospho predicted libraries, we used U2OS DDA and DIA data, RPE1 DDA and DIA data as described above, as well as DDA and DIA data from a human/yeast two-proteome model 3 downloaded from PRIDE with the identi er PXD014525. The yeast DDA library built from the human/yeast two-proteome model data was also provided in a .kit format and processed the same way as RPE1 DDA data. Direct DIA libraries were generated from RPE1 DIA data and U2OS DIA data as described above. In addition, the human/yeast direct DIA library was generated by Spectronaut using the Uniprot human reference proteome and Uniprot S. cerevisiae reference proteome as the sequence databases.
All phoshopeptides from the external datasets that were used for model training and evaluation need to have a phosphosite localization score >0.75 in MaxQuant or Spectronaut output les (class I sites).
Details in the external data source, sample source, MS instrument condition and data processing are described in the Supplementary Table 1. Processing of the phosphoproteome and phosphosite databases. To construct Lib 4, Lib 5 and Lib 7 in Figure 3, Lib 4 and Lib 7 in Figure 4, Lib 4, Lib 7 and Lib 8 in Figure 6, we created three databases for generation of predicted libraries: hPhosPepDB, hPhosSiteDB and yPhosPepDB. We built hPhosPepDB based on a published human phosphoproteome database 24 which recorded the sequences, PTM sites, charge states and calibrated RT for 204,606 label-free, trypsinized, con dently localized phosphopeptides (Ascore >13) from 12,228 proteins detected in large-scale phosphoproteomic experiments from various sources. To nd out the best condition for generation of a predicted library from hPhosPepDB, we restricted precursor and fragment mass ranges, peptide length, PTM site number and charge state in speci c predicted libraries for performance testing. This was performed to generate the optimized Lib 4 and Lib 7 in both Figure 3 and Figure 4, and Lib 8 in Figure 6.
We then built hPhosSiteDB based on the human protein phosphosites registered in EPSD database 25 and in silico digestion of the whole human proteome. Speci cally, in silico phosphopeptide sequences were computed using these criteria: trypsin speci city in digestion; peptide length from 7 to 30; no miss cleavage; adding phosphosites that are registered for speci c proteins in EPSD; max number of the phosphosite in each peptide is 1. As a result, hPhosSiteDB contained 350,719 unique phosphopeptide sequences. Their charge states were de ned as 2, 3 and 4.
The yeast phosphoproteme database yPhosPepDB was built based on 36,954 yeast phosphopeptides detected in a yeast R2P2 phosphoproteomic study using various extraction and enrichment approaches 6 . The original charge states assigned in MaxQuant output les are kept for all phosphopeptides in yPhosPepDB. Notations and Data Representation. Each input peptide is represented by a sequence of amino acid tokens denoted as L, K, M, etc., typically 7-50 in length. For phosphopeptides, we use 1 to represent the oxidation of methionine (M), and 2, 3, 4 to represent the phosphorylation of serine (S), threonine (T), tyrosine (Y), respectively. In addition, DeepPhospho supports peptides with an N-terminal acetyl modi cation. We use the * symbol to indicate modi cation and @ to indicate no modi cation.

More details of the three databases are provided in Supplementary
For the task of fragment ion intensity prediction, we denote the model input as , where is the token of * or @, denotes the amino acids, n is the peptide length, and +q is the peptide precursor charge. The output spectrum or the peptide fragmentation pattern is represented by a matrix of size where is the maximum peptide length in the dataset and each row is a set of intensity values for different combinations of b/y ions, two charge states (+1 or +2) and with or without loss of phosphate (-1,H3PO4 or -noloss).
Fragment ion intensity values at impossible dimensions are set to −1 while the rest are normalized to .
For the task of iRT prediction, we use the peptide without charges as our input, denoted as , where and are described above. The output is the retention time and is normalized to for each dataset.
Model Architecture. The DeepPhospho model consists of three main modules, including an embedding network, a sequence modeling network and a regression network. The embedding network encodes the input tokens into feature vectors while the regression network generates output predictions. As the fragment ion intensity and iRT prediction have different forms of input and output, we adopt separate designs for the embedding and regression network in those two tasks. 1) Embedding network. For the fragment ion intensity prediction, we rst embed each amino acid and the charge to vectors of 192 and 64 dimensions respectively and then concatenate them as inputs to the sequence modeling module. For the RT prediction, we directly embed each amino acid into a vector of 256 dimensions.
2) Sequence modeling network. We adopt a hybrid network for the main module of our model, which consists of a bidirectional Long Short-Term Memory (biLSTM) 32 subnet and a Transformer 33 subnet. Our biLSTM subnet comprises two stacks of bidirectional LSTM with hidden dimensions of 512. This module aims to compute an initial representation of the peptide sequence, which is then fed into the second module, the Transformer subnet. The Transformer aims to capture long-range dependency in the peptide sequences with more effective attention mechanism. Our Transformer subnet stacks multiple Transformer encoders, each of which has 8 self-attention head. We also use the standard sine and cosine functions as the position encoding 33 . More speci cally, for the fragment ion intensity prediction, the Transformer subnet comprises 8 layers of Transformer encoders. For the iRT prediction, we use an ensemble of networks with 4 to 8 layers of Transformer encoders.
3) Regression network. We use a simple linear layer to project the features at each amino acid site to a vector of 8 dimensions as our output in the task of fragment ion intensity prediction. For the iRT prediction, we introduce a linear layer to generate an instance-speci c weight for sequence features and use a weighted average to produce the RT prediction.
Model Training. We adopt a transfer learning strategy to train our models. For the task of fragment ion intensity prediction, we rst pre-train our model on three datasets (Supplementary Table 1), which provide a good initialization. We then ne-tune the pre-trained model on each of three target datasets (Supplementary Table 1). Speci cally, for the pre-training datasets, we split each of them into a training and a validation set with a 9 : 1 ratio; for the three target datasets, we split each of them into training, validation and test set with a 8 : 1 : 1 ratio. We use the mean squared error (MSE) as our loss function and the Adam update 34 to optimize the loss with learning rate 1e-3 on the rst pre-training phosphoproteome dataset, and 1e-4 on the other datasets. We decay the learning rate by 0.1 after prede ned number of epochs. We tune the model hyper-parameters and select the best model on the validation set. We report the result on the test set of the target datasets. For the task of iRT prediction, we use the same pre-training procedure, in which the RT values are normalized into [0,1] on the three datasets. For the target datasets, we manually set the min(RT) and max(RT) equals -100 and 200, respectively. We use the Root Mean Square Error (RMSE) loss and take the same training strategy as in the fragment ion intensity task. Metric. For the task of fragment ion intensity prediction, we compute the Pearson Correlation Coe cient (PCC) between the prediction and the ground truth of each peptide and select the median of those PCCs as the nal evaluation metric. In addition, we follow Prosit 11 and use normalized spectral angle (SA) as another metric, and also report the median of those SAs. The normalized spectral angle is de ned as follows.
where y, y are two vectors whose L2 norm equals 1. We select the model by the median PCC metric.
For the task of iRT prediction, we adopt the ∆t 95% metric as the main metric, which represents the minimal time window containing the deviations between observed and predicted RTs for 95% of the peptides: The subscript 95% means the 95% rank of the deviations.
Model architecture validation. To validate our model design, we conduct a set of ablative study on the model architecture. Speci cally, we compare our model with several alterative model designs, including the biLSTM module only, the Transformer module only, and replacing biLSTM with a CNN module (CNN+Transformer). We use a variant of ResNet34 35 in our setting. We perform the comparisons on both the fragment ion intensity and iRT prediction benchmarks (Supplementary Table 1). We split each dataset into training : validation : test = 8 : 1 : 1, and after model selection on the validation set, we report the results on the test set. PCC and SA are used to validate the prediction of fragment ion intensity and median absolute error (MAE) is used to validate the iRT prediction.
Comparison of DeepPhospho with other models. We compared our method with several published models, including pDeep2, DeepMS2, and MS2PIP on three datasets: RPE1 DDA, RPE1 DIA, and U2OS DIA. For pDeep2, the source code and pre-trained model parameters pretrain-180921-modloss.ckpt were downloaded from their website https://github.com/pFindStudio/pDeep/tree/ master/pDeep2, and transfer learning was performed with the default hyper-parameters. For DeepMS2 (https://github.com/lmsac /DeepDIA), we directly used the pre-trained model parameters epoch_035.hdf5 for precursor charge of value 2 and epoch_034.hdf5 for precursor charge of value 3 for prediction. Then we followed the budding strategy described by the authors 20 to generate in silico spectra for phosphopeptides with scripts stored in https://github.com/lmsac/DeepMS2-phospho. For MS2PIP, we directly used the MS2PIP server https://iomics.ugent.be/ms2pip for prediction with the model HCD (including b++ and y++ ions).
Analysis of synthetic phosphopeptides. Seven phosphopeptides were synthesized by GenScript (Nanjing, China). The peptide powders were dissolved in ultrapure water or DMSO to prepare 5-10 mg/ml stocks.
The stock solution was diluted to 100 ng/μl using 0.1% FA and seven phosphopeptides were mixed together before injection into the nanoLC-MS system for DDA and PRM data acquisition.
The nanoLC-MS/MS analysis was conducted on an EASY-nLC 1200 connected to QE HF mass spectrometer (Thermo Fisher Scienti c, USA) with a nano-electrospray ionization source. The peptide mixture of 10 ng was loaded in each replicate and separated on an analytical column (200 mm x 75 μm) in-house packed with C18-AQ 1.9 μm C18 resin (Dr. Maisch, GmbH, Germany) over a 60-min gradient from 4% to 45% mobile phase B (0.1% FA in acetonitrile) at a ow rate of 300 nl/min. In DDA data acquisition, the resolution of Orbitrap analyzer was 60,000 for MS1 and 15,000 for MS2. The AGC target was set to 3e6 in MS1 and 1e5 in MS2, with a maximum ion injection time of 120 ms in both MS1 and MS2. The isolation window was set to 1.6 m/z, and stepped collision energy at 25%, 27%, 30%. In PRM data acquisition, the resolution of Orbitrap analyzer was 120,000 for MS1 and 30,000 for MS2. The AGC target was set to 3e6 in MS1 and 5e5 in MS2, with a maximum ion injection time of 20 ms in MS1 and 120 ms in MS2. The isolation window was set to 1.0 m/z, and stepped collision energy at 25%, 27%, 30%. The inclusion list contained the precursor m/z and RT windows for the seven phosphopeptides that were detected in the DDA experiment.
Acquired DDA raw data was analyzed using MaxQuant (v1.6.17.0) against the seven phosphopeptide sequences appended with a contaminant sequence database. The following search parameters were used: no xed modi cation, Phospho (STY) as variable modi cation, and trypsin as speci c enzyme. The rst search tolerance was set to 20 ppm, main search tolerance to 4.5 ppm, ltered for PSM and protein FDR of 1%. Then the msms.txt le exported from MaxQuant was imported into Skyline 36 (v20.2.0.343) to build a library. PRM data was analyzed by Skyline with the major settings: precursor charges 2, ion charges 1 and 2, ion types p, b, y, product ion selection from m/z > precursor to last ion, library pick product ions 25, use scans within 30 min. All XICs of selected fragments were manually inspected and adjusted to ensure proper peak picking and peak integration. An "idotp" value of each precursor of >0.9 was accepted. For the generation of predicted libraries, a list of phosphopeptide sequences collected from the DDA library, the direct DIA library or a speci c phosphoproteme or phosphosite database was input to the trained DeepPhospho models for the prediction of fragment ion intensity and iRT. In some cases, a direct DIA library need to be merged with a predicted library to yield a hybrid library. For any redundant peptides present in both the direct DIA library and the predicted library, their experimental MSMS spectra and iRT values in the direct DIA library were retained in the hybrid library. All predicted libraries and hybrid libraries generated in-house were written in a tab separated value (TSV) le to be processed by Spectronaut in DIA data analysis.
In the iterative search, we created a focused library corresponding to each complete library used in the initial search. For Lib 1 and Lib 2 in Figure 4 and Figure  DIA data analysis and library-speci c FDR estimation. Raw DIA data were processed using Spectronaut v14.5 with default settings. In brief, PTM localization was activated and site probability score cutoff was set to 0.75, data ltering was set to Q-value and Normalization Strategy set to Global Normalization.
Decoy generation was set to mutated. Interference correction was enabled and the number of minimum inferenced ions was 2 and 3 for MS1 and MS2, respectively. Peptide and protein level Q-value cutoff was set to 1%. In each analysis, a speci c experimental library, predicted library or hybrid library generated earlier was imported to Spectronaut. For U2OS DIA data and RPE1 DIA data analysis, Uniprot human reference proteome (UP000005640, 84,823 protein sequences, downloaded in 2020/06) was used as the protein sequence database. For the analysis of the two-proteome model, Uniprot human reference proteome and Uniprot S. cerevisiae reference proteome (7,500 protein sequences, downloaded in 2020/09) were used. After DIA data processing, the peptide and protein reports were exported for further statistics and bioinformatics analysis.
Although Spectronaut automatically set FDRs to be <1% at both peptide and proteins levels, we created decoy libraries for independent assessment of the library-speci c FDR. For each library to be assessed (i.e. a target library), the sequences of all peptides in this library were reversed except the C-terminal residue, and the original charge states were kept. The resulting reversed peptides were imported to the trained DeepPhospho models for prediction of their fragment ion intensities and iRT values to generate a decoy library with the size identical to the target library. Then the target library was appended with the corresponding decoy library to generate a target-decoy library which was used to process the DIA data. The library-speci c FDR is de ned as where Hits decoy indicates the number of peptides identi ed from the decoy portion, and Hits target indicates the number of peptides identi ed from the target portion of the target-decoy library.
Statistics and Bioinformatics analysis. In the phosphosignaling study with RPE1 DIA data, Spectronaut reports were rst modi ed to be compatible with PerseusR 37 , and then transformed into a modi cation speci c peptide-like reports using Peptide Collapse 3 , with the target PTM site as the collapse level, localization cutoff 0.75, and variable PTMs in the order of Phospho (STY), Oxidation (M) and Carbamidomethyl (C). The reported intensities were log2-transformed and z-scored. Quanti able phosphopeptides and phosphosites were selected if their intensities were measured in all three replicates for at least two different treatments. One-way ANOVA test was applied to the quanti able phosphosites to identify signi cantly regulated sites (p <0.05) at EFG or any kinase inhibitor treatment vs control ( Figure  5a). The Tukey's range test implemented in statsmodels was then applied to ANOVA-signi cant phosphosites to identify the EGF-regulated sites which were signi cantly changed at EGF treatment vs control (adjusted p <0.05). The EGF-regulated sites were further divided into two classes: one is the MEKdependent phosphosite which was also signi cantly changed according to the Tukey's range test at one of the kinase inhibitor treatment with the opposite trend of regulation to the EGF treatment; the other is the MEK-independent phosphosite which showed no signi cant regulation at any kinase inhibitor treatment (Figure 5b).
The hierarchical clustering was implemented to ANOVA-signi cant phosphosites identi ed at EGF or highdose inhibitor treatment using Scipy 38 , with the metric set to correlation and the method set to average.
To ll in the expression matrix for clustering and heatmap, NA values were imputed by randomly sampling values from a normal distribution with the mean of -1.5 and standard deviation of 0.5.
Heatmaps of all signi cantly regulated phosphosites were generated by unsupervised hierarchical clustering.
The kinase-substrate pair enrichment was performed based on the kinase-substrate relationship downloaded from PhosphoSitePlus 39 (access date: 2021/01) using the sher exact test implemented in Scipy 38 . We used all EGF-regulated phosphosites as the input and all identi ed phosphosites as the background. Signaling pathway enrichment was then performed based on the Reactome pathway data 40 (access date: 2021/01) using the sher exact test. EGF-regulated phosphosites were rst collapsed to the protein level and used as the input while the background was all identi ed phosphoproteins. A total of 14 signi cantly enriched pathways and 13 signi cantly enriched kinases (adjusted p <0.05) were initially identi ed using any of the six tested libraries. Eight enriched non-redundant pathways and nine overrepresented kinases that were discovered using at least two spectral libraries were kept and shown in Figures 5e, 5f.
In the quantitative two-proteome model study, the Spectronaut exported peptide precursor intensities were rst de-normalized by dividing reported intensity values by their normalization factors. The measured ratio of a phosphopeptide identi ed at any of the four dilution conditions (0.25:1, 0.5:1, 1.5:1 and 2:1) were calculated by dividing its intensity measured at that condition by the intensity measured at the control condition (1:1). Then quanti able yeast or human phosphopeptides and phosphosites were selected if they had at least one ratio measured at any dilution condition relative to control.
Boxplots were created with boxes marking the rst and third quartile, a dash the median, and whiskers the minimum/maximum value within 1.5 interquartile range. Outliers are not displayed.   MSMS spectra prediction by DeepPhospho pinpoints false identi cations in the DIA library. a, Distribution of PCC between spectra predicted by DeepPhospho vs spectra assigned in the DIA library for two datasets (upper). For phosphopeptides of low spectral similarity (PCC within -1 to 0.3), their PCC distribution are calculated between predicted spectra vs reference spectra, and between predicted spectra vs DIA library spectra, and plotted around the diagonal (lower). Reference spectra were obtained by goldstandard DDA analysis of the same phosphopeptide samples. b, Correlation between the predicted spectra and the high-quality spectra of the synthetic peptide, and between the predicted spectra and the DIA library spectra, for seven selected phosphopeptides. c, Spectra mirror plots for phosphopeptides show much higher similarity between the predicted spectra and the synthetic peptide spectra than between the predicted spectra and the DIA library spectra. Relative fragment ion intensities in the predicted spectra, the DIA library spectra and the synthetic peptide spectra are annotated by purple, orange and blue lines. * indicates the loss of a phosphate. Generation of DeepPhospho predicted libraries for DIA phosphoproteomics data mining. a, An experimental DDA library or direct DIA library can be converted to a predicted DDA library or a predicted DIA library by DeepPhospho. A predicted library can be also generated from public phosphoproteome or phosphosite databases, or external phosphoproteomics data. b, Design of seven spectral libraries for the U2OS DIA data analysis. DDA/dDIA, experimental DDA or direct DIA library; predDDA/predDIA, predicted libraries converted from the DDA and DIA library; hPhosPepDB/hPhosSiteDB, predicted libraries built from public human phosphoproteome and phosphosite databases. Lib 3 to Lib 7 are comprised of two separate libraries. c, Number of phosphopeptides and phosphosites identi ed using each library.
Percentage of the total phosphopeptide or phosphosite number is shown for each predicted library relative to the project-speci c DDA library (Lib 1). The proportions of shared identi cations (IDs), gained IDs, lost IDs and gap IDs yielded by Lib 2 to Lib 7 compared to Lib 1 are indicated in different color. Gap IDs are those present in Lib 1 yet absent in the DeepPhospho predicted libraries, thus they cannot be identi ed with the latter. d, Library-speci c FDR assessed using the target-decoy strategy. e, % coe cient of variation (CV) of all phosphopeptide quanti cation between 10 replicates at each condition.

Figure 4
DIA data analysis with DeepPhospho predicted libraries in a phosphosignaling study. a, Experimental design of a biological study of EGF-dependent signaling in the context of MEK inhibition. b, Number of phosphopeptides and phosphosites that were quanti ed from the initial search using each library. c, Procedure of building a focused library to be used for an iterative search. d, Number of phosphopeptides and phosphosites that were quanti ed from the iterative search using each library. Percentage of the total quanti able phosphopeptide or phosphosite number is shown for each predicted library relative to Lib 1.
The proportions of shared identi cations (IDs), gained IDs, lost IDs and gap IDs yielded by Lib 2 to Lib 7 compared to Lib 1 are indicated in different color.