In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics

Data-independent acquisition (DIA) is an emerging technology for quantitative proteomic analysis of large cohorts of samples. However, sample-specific spectral libraries built by data-dependent acquisition (DDA) experiments are required prior to DIA analysis, which is time-consuming and limits the identification/quantification by DIA to the peptides identified by DDA. Herein, we propose DeepDIA, a deep learning-based approach to generate in silico spectral libraries for DIA analysis. We demonstrate that the quality of in silico libraries predicted by instrument-specific models using DeepDIA is comparable to that of experimental libraries, and outperforms libraries generated by global models. With peptide detectability prediction, in silico libraries can be built directly from protein sequence databases. We further illustrate that DeepDIA can break through the limitation of DDA on peptide/protein detection, and enhance DIA analysis on human serum samples compared to the state-of-the-art protocol using a DDA library. We expect this work expanding the toolbox for DIA proteomics.


Supplementary
Three human serum samples without high-abundant proteins (HAP) depletion, each with 2 DIA runs without or with PQ500 Reference Peptides Kit.

Supplementary Note 1. Evaluation of reproducibility of peptide MS/MS spectra and RT across instruments and labs
We collected DDA data of HeLa cells from five Orbitrap mass spectrometers in different labs. HeLa1 1 (ProteomeXchange identifier PXD005573, Supplementary precursors and 11,555 triply-charged precursors) with HeLa1, respectively. Dot products 5 (DP) were computed between peak intensities of b/y/neutral loss product ions of the same precursors in HeLa1 and the other datasets (Supplementary Fig. 1a).
Similarities of peptide MS/MS spectra acquired on Q Exactive HF across different labs (HeLa2-HeLa1 and HeLa3-HeLa1) were higher than those across different types of Orbitrap mass spectrometers (HeLa4-HeLa1 and HeLa5-HeLa1), but were lower than those acquired on the same instrument (HeLa1). Pearson correlation coefficients (r) of iRT and iRT differences were also computed between HeLa1 and the other datasets (Supplementary Fig. 2a).

for peptide MS/MS spectrum prediction
The performance of peptide MS/MS prediction of DeepDIA was compared with  . 2c). We also performed comparison on a dataset of mouse tissue 1 (Mouse1 DeepDIA, Pearson correlation coefficients (r) of predicted and experimental iRT were higher than Prosit and SSRCalc, while the interquartile ranges (IQR) of the differences between predicted and experimental iRT were smaller than Prosit and SSRCalc. Currently, Prosit website does not offer iRT refinement service, and thus prediction was performed with the existing Prosit model trained on the ProteomeTools data. We believe that prediction accuracy would be improved once refinement was performed by retraining the model, as reported in their publications. Indeed, training instrument-specific models is exactly what we are suggesting.

Supplementary Note 4. Performance comparison of DeepDIA and Prosit on the datasets of HeLa cells and mixed proteome samples
The performance of DeepDIA was compared with Prosit for DIA analysis on a dataset of HeLa cells (HeLa1). In silico spectral libraries were generated using DeepDIA (HeLaPredicted, see Supplementary Table 2 for details) and Prosit (HeLaProsit,  Fig. 3a and 3b). Numbers of detected peptides and protein groups were maximized when CE = 30. 29% more peptides (excluding those with length > 30, which is not supported by Prosit) and 6% more protein groups were detected by HeLaPredicted than HeLaProsit (Fig. 3a). The median coefficients of variation (CVs) of peptide precursor and protein group quantification results detected in three technical replicates using the HeLaPredicted library were smaller than those using the HeLaProsit library (Fig. 3b).

Supplementary
DeepDIA and Prosit were also compared on a dataset of mixed proteome samples 1 containing peptides from Homo sapiens, Caenorhabditis elegans, Saccharomyces cerevisiae and Escherichia coli with different abundance (see Supplementary Table   1 for details). In silico spectral libraries were generated using DeepDIA (MixPredicted, see Supplementary  Fig. 5). Percent changes of detected precursors and protein groups of each organism between the two samples were computed based on the mean quantities in three replicates of each sample. The percent changes estimated using the MixPredicted library were more accurate than those using the MixProsit library, at both precursor and protein group level (Fig. 3d).

Supplementary Note 5. Prediction of peptide detectability by mass spectrometry using deep learning
Peptides of proteins identified with sequence coverage ≥ 25% were collected from a dataset of HeLa and HEK-293 cells 1 (HeLa&HEK, Supplementary Table 1  However, in many cases, there is a lack of peptide-level prior knowledge, and spectral libraries can only be built from protein sequences. By predicting peptide detectability by mass spectrometry, target peptides can be selected from proteins with an appropriate detectability score threshold as discussed above. Strategies for large-scale database searching such as iterative searching 13 , two-step methods 14 , and sectioning approaches 15 are worth a shot. Attempts were taken to adapt a two-step approach for DIA analysis using spectral libraries generated from proteome-scale databases, e.g. SwissProt. On the HeLa1 dataset, we downsized the spectral library generated from SwissProt H. sapiens (HumanProt50) to the proteins detected in the first search. Protein inference was re-performed on the in silico digested peptides of the proteins detected in the first search with ≤ 2 missed cleavages and with detectability score ≥ 0.5, and consequently the library contained 5,831 proteins ( level. The number of identified protein groups and peptides also increased after the second search (Supplementary Fig. 8). Similar results were obtained on the Mouse1 dataset. During the second search, the library size was smaller and the library was more specific to the sample, which could lead to better performance in peptide and protein identification.