Deep neural network for detecting arbitrary precision peptide features through attention based segmentation

A promising technique of discovering disease biomarkers is to measure the relative protein abundance in multiple biofluid samples through liquid chromatography with tandem mass spectrometry (LC-MS/MS) based quantitative proteomics. The key step involves peptide feature detection in the LC-MS map, along with its charge and intensity. Existing heuristic algorithms suffer from inaccurate parameters and human errors. As a solution, we propose PointIso, the first point cloud based arbitrary-precision deep learning network to address this problem. It consists of attention based scanning step for segmenting the multi-isotopic pattern of 3D peptide features along with the charge, and a sequence classification step for grouping those isotopes into potential peptide features. PointIso achieves 98% detection of high-quality MS/MS identified peptide features in a benchmark dataset. Next, the model is adapted for handling the additional ‘ion mobility’ dimension and achieves 4% higher detection than existing algorithms on the human proteome dataset. Besides contributing to the proteomics study, our novel segmentation technique should serve the general object detection domain as well.

z=4 Sliding window with 50% overlapping 86% 50% 72% 57% 36% Skiplink inserted in above model 85% 56% 76% 66% 41% Bi-directional 2D RNN 85% 51% 77% 67% 49% Attention mechanism 84% 62% 85% 71% 50% Attention mechanism with higher resolution 90% 64% 85% 81% 61% Supplementary Table S 2. Different techniques of absorbing surrounding information and corresponding class sensitivity of IsoDetecting module. We define the class sensitivity of a scanning window as the number of datapoints from class z (0 to 9) detected correctly out of total number of datapoints in a scanning window. To evaluate candidate solutions we use the class sensitivity of high abundant features (charge z = 1,2,3, and 4) in a average case scenario. Average case means the scanning window might contain any number of features, they may appear at any location of the window, they might be partially or fully seen, and might be overlapping as well. We see that the DANet inspired attention based mechanism works better than other techniques.
• First, we just calculated the global features of surrounding regions and diffuse them together by addition and concatenation with the global features of target window. Then we repeated the 50% overlapping technique, which did not bring any significant change. Using some skip links along with that brings little improvement as shown in the second row of the table.
• Then we used a bi-directional two dimensional RNN network to flow information from all direction into the target window. The corresponding class sensitivities are provided in the third row.
• Finally, we applied the attention mechanism proposed by DANet which works better than first two techniques, as reported in the fourth and fifth row. So we choose PointNet segmentation network combined with DANet to develop our IsoDetecting module.
Supplementary Figure S 1. (a) Here we see a peptide feature with three isotopes and charge 2. But if this is passed to the IsoGrouping module with wrong frames (dotted rectangles) because of the wrong charge (z = 4) predicted by the IsoDetecting step, it results in discarding this whole group of frames (by predicting it as noise) due to the inconsistency (blank frames) observed. (b) We see the adjacent feature problem. (c) We see one training sample prepared for IsoGrouping module to solve the adjacent feature problem. (d) In the topmost rectangle we see isotopes of a peptide feature in black traces and some secondary signals as well. The middle rectangle is showing the true labeling of the datapoints using two colors: grey means negative class and black means positive class. The primary signals are detected by PointIso as shown in the bottom rectangle. However, the seconadary signals are also predicted as positive class by PointIso. But this is wrong. Datapoints in those regions should be predicted as negative class by IsoDetecting module. So we select such peptide features for fine tuning.

Fine Tuning Using Misclassified Features
In this section we will discuss the approach of fine tuning. Fine tuning the primary model by feeding back the misclassified data played an essential role in overall improvement. After we train and validate IsoDetecting and Isogrouping module, we run a full scanning on the same LC-MS maps used for training (e.g., in 3D dataset, LC-MS maps of dilution sample 9 to 12 for the first fold). That means, we scan the full LC-MS map by IsoDetecting module in a column by column order, which gives a list of sequence of equidistant isotopes. Then those are passed to IsoGrouping module for generating the final list of peptide features. Then this peptide feature list is matched against the MS/MS identified peptide features for the respective LC-MS map by MASCOT. Based on the matching result, we select the misclassified peptide features. By the term misclassified we mean four particular cases: peptide features not detected due to very long retention time range, peptide features detected with wrong charge, multiple peptide features grouped into one feature, and false positives due to secondary signals. Then we generate training samples from those misclassified cases as mentioned in the training data generation subsections of Method section in the main text and retrain both modules using those. We call this approach fine tuning by misclassified peptide features. After that we run the testing on test set. How do we detect those four types of misclassified features, extract those from LC-MS map, and label them for fine tuning PointIso, are explained below in more detail. While providing examples, we will be using LC-MS maps from fold 1.
• First, some identified peptide features were not detected and when we inspected those visually in the LC-MS map by PEAKS Studio, we discovered that those have comparatively longer retention time range, about 0.30 minute or longer. Please note that, those features were provided during initial training. But those cases are not learned well. So we provide such samples twice in the dataset (i.e., given more weight or emphasized by duplication) and retrain the model. It improves the detection rate by about 2% (fourth row in Table 3).
• Second, we compare the clusters or sequences of isotopes returned by IsoDetecting module with the MS/MS identification result for the respective LC-MS map and select those sequences which are matched with the identification result in terms of m/z range and retention time range but DOES NOT match in terms of charge and are also rejected later by IsoGrouping module. So such features are selected and fed back for further learning by IsoDetecting module. These sequences present the features which are detected by IsoDetecting module but with wrong charge. This usually happens when the feature appears in middle region of the target window (like 'F' in Figure 4(a) of the main manuscript). As a result, the wrong set of frames are passed to the IsoGrouping module, as shown in Figure S1(a). This causes complete rejection of the feature by the IsoGrouping module, and we miss a feature although it was detected in the first module. Please note that, it is possible that these features went through the usual training before. However, since these selected ones are not learned well, so we are just asking the IsoDetecting module to learn these again with greater emphasis (ensured by duplicating these samples during retraining as done in previous point). Retraining the IsoDetecting module using such samples improves the detection rate by about 1.5% (sixth row in Table 3 of main manuscript).
• Third, adjacent features as shown in Figure S1 (b) were not correctly separated by IsoGrouping module. Here, both features have the same charge, same or very close RT value at the peak intensity point, and the distance between the two features along m/z axis is equal to the distance between their own isotopes. We select such features by comparing the sequences of isotopes returned by IsoDetecting module with the MS/MS identified features for the respective LC-MS map and finding out those sequences where the 2 nd /3 rd /4 th /5 th frame matches with the identification result in terms of m/z, RT, and charge, but not the 1 st frame. Because it means that the monoisotope appears on the 2 nd /3 rd /4 th /5 th frame and the isotopes preceding it are coming from some other feature that is preceding and adjacent to this matched feature. So we select such sequence and label them in a way which will break the feature into two. For example, the sequence shown in Figure S1(c) is labeled as 1. So that IsoGrouping will learn to output '1' for this sequence and this will break it into two features as shown by the dotted rectangles.
• Finally, we fine tune the model with false positives which are essentially some feature-like noisy or secondary signals appearing very close to the main isotopic signal ( Figure S1(d), topmost rectangle). Our initial trained model reports those feature-like signals as peptide features ( Figure S1(d), bottom rectangle) and as a result we get about 200,000 peptide features in total. We visually traced some of the false positive features and realized that those are actually secondary signals. Although random noise removal is learned easily by PointIso but separation of feature like noisy traces and those secondary peaks removal is difficult for PointIso. This tasks has to be performed by IsoDetecting module since it has access to the whole context. So it has to see through all the signals and decide which ones are to be reported and which are to be ignored. In order to collect those wrong reports, we match the peptide feature list returned by PointIso (for the 4 th replicate of dilution sample 9, in fold 1) with the peptide feature list produced by all other tools (i.e., OpenMS, MaxQuant, Dinosaurs, and PEAKS) with an error tolerance of 0.01m/z and 0.2 min RT. The features from PointIso which neither match with other tools nor with the identification result by MASCOT are selected for fine tuning (about 70,000 features). As mentioned in the Method section, a training sample for the IsoDetecting module consists of a target window and its four surrounding regions. So we cut out those false positive features from the respective LC-MS map keeping the feature in the target window. All the datapoints in the target window which do not belong to primary signal are labeled as '0'. Then these newly generated training samples (about 30,000) are used for retraining the already trained model, by running few epochs with 0.0001 learning rate and keeping the best model state. This same set of features is also used for fine tuning IsoGrouping module. In order to make a training sample for fine tuning IsoGrouping module, we pick a false positive (isotopes are secondary signals) and cut a sequence of 5 frames each holding the respective secondary signals. Then we label the sequence as '0' indicating that no feature exists in that sequence. We can actually balance between the true positive detection rate and false positive detection rate using the amount of such training samples. Without any fine tuning we get 99.50% detection, with about 200,000 total features. After fine tuning with 30,000 samples we have 98% detection with 100,000 features (which is reported in the main result). If we fine tune with 50,000 samples then total number of features drop to 80,000 but the detection percentage also falls to 97.53%. So it seems like if we are too strict about discarding the secondary signals, we may also miss some true lower intensity peptide features which are close to other higher intensity peptide features. It should be an interesting research scope to see if we can solve this problem without a reduction in detection percentage. Besides that, we believe the necessity of this step also depends on the dataset and the mass spectrometer used. Because high-resolution mass spectrometers usually produce narrower signals without those secondary peaks, thus we might avoid this step and obtain over 99% detection. In our manuscript we have used fine tuned model of IsoDetecting module. But the other models are also uploaded in the GitHub repository. Users can choose according to their need. They can also fine tune further if necessary.

Supplementary Note D Basic Properties of Peptide Feature
It is supposed to learn following basic properties of peptide feature, besides many other hidden characteristics from the training data.
1. In the LC-MS map, the isotopes in a peptide feature are equidistant along m/z axis. For charge z = 1 to 9, the isotopes are respectively 1.00 m/z, 0.5 m/z, 0.33 m/z, 0.25 m/z, 0.17 m/z, 0.14 m/z, 0.13 m/z, and 0.19 m/z distance apart from each other ? .
2. The intensities of the isotopes form bell shape within their retention time (RT) range as shown in the zoomed in view of Figure 1 of the main manuscript.

Cross-Validation Technique
Our model is trained and evaluated through 2-fold cross-validation. That is, we divide the LC-MS maps in the dataset into two groups or two folds. Then we train on one group and test on the other group, and vice versa. In order to save the best model state during training, we keep one LC-MS map from the training set as a validation set. Then we use that model state for testing. Whenever we say train or validation on a LC-MS map, we mean training/validation on the features cut from that LC-MS map (explained later under the subsections regarding training data generation for IsoDetecting and IsoGrouping module). However, when we say test on a LC-MS map, we actually mean scanning the full LC-MS map as shown in the block diagram of Figure 1 in main text. That is, during testing (or the real application phase) we will actually scan the whole LC-MS map in a bottom up, left to right fashion (in other words, column by column). Now, we discuss how the LC-MS maps are divided into different folds for cross-validation as follows: • In the 3D dataset (downloaded from ProteomeXchange with accession number PXD001091), there are 12 dilution samples for the LC-MS analysis, each sample having a different concentration of spike proteins. Each sample goes through the physical LC-MS instrument and produce a LC-MS map. Usually each experiment is repeated multiple times, and each resultant LC-MS map is called a replicate. Dilution sample 1 and sample 5 to 12 have 4 replicates each (e.g., 130124_dilA_1_01, 130124_dilA_1_02, 130124_dilA_1_03, 130124_dilA_1_04 are the four replicates for sample 1. These names are also mentioned in the final result shown in Supplementary Table S3). Dilution sample 2, 3, and 4 have 7 replicates each. Therefore, we have 57 LC-MS maps in total. We divide these samples or LC-MS maps into 2 groups, X

4/11
and Y. Group X contains the LC-MS maps from sample 9, 10, 11, 12. Group Y contains the LC-MS maps from sample 1 to 8. We first train the model on Group X (i.e., sample 10,11,12 are used for training and sample 9 is used as a validation set to save the best state of the model) and test the trained model on Group Y. Then we do the opposite. We train the model on Group Y (sample 5, 6, 7, 8 are used for training and sample 1 is used for validation) and test the trained model on Group X. The result of testing on all the samples is provided in Supplementary Table S3. Please note that, we report the result separately for each replicate. Detection percentage for a sample is the average detection percentage for all the replicates of that sample. We have already reported the detection percentage for 12 samples in Figure 2(a) and Table 1 of main text.
• Next, we discuss the distribution of LC-MS maps in our 4D dataset (downloaded from ProteomeXchange with accession number PXD010012) into different folds. For the 4D dataset, we have samples from two mobile phases: A and B. A provides 12 LC-MS maps (denoted as 2042 to 2053) and B provides 4 LC-MS maps (2054 to 2057). In total we have 16 LC-MS maps. Therefore we just divide it into two groups having equal size. Group X has 8 LC-MS maps: 2042 to 2049. Group Y has the remaining 8 LC-MS maps: 2050 to 2057. Just like before, we train on Group X (2044 is used for validation and the rest for training) and test on Group Y. Then we train on Group Y (2054 is used for validation and the rest for training) and test on Group X. The test result for each LC-MS map is provided in the plot shown in Figure 2(c) of main text.

Supplementary Method
Therefore, Le f t_Impact_on Target ( j,i) presents the attention of i th point of left region over the j th point of target window. The higher the value, the higher the correlation (similar feature) between those two points. So the j th row tells us which datapoints from left region have higher attention or highly correlated with the j th datapoint of target window. Next, we want to fetch those *significant point features from left region. So we apply another round of matrix multiplication (4 th operation) between Le f t_Impact_on_Target and Point_Feature Le f t . We denote the resultant product as Filtered_Le f t since it essentially gives us Point_Features le f t but scaled/filtered according to the aforementioned correlation or attention. Then again we have to know how much of those filtered features should be incorporated with the Point_Feature target while segmenting the

5/11
datapoints of target window. So we use a weight matrices, namely Attention_Weight le f t , and multiply it with Filtered_Le f t, producing Attention le f t , which is finally passed forward to be diffused (by addition) with the Point_Feature target . This Attention_Weight le f t is learned through training.

Supplementary Method B Resolution Degradation in IsoGrouping Module
IsoDetecting module generates sequences of potential isotopes which are sent to the IsoGrouping module for final detection of peptide features. Now, the m/z value of isotopic signals are real numbers having up to 4 decimal places. We degrade each signal into real numbers having up to 2 decimal places. For example, let us have two sequences of isotopes A, and B, where the first isotope of the sequences are denoted as A 1 and B 1 respectively. The m/z values of these are respectively A 1_mz = 500.2351 and B 1_mz = 500.2443. During resolution degradation we filter out the signal instensity from the background with +-2 ppm m/z range. The range is calculated as: A 1_mz ×2.0 10 6 . So that we have A 1_mz = 500.24 and intensity is set as the maximum of the intensities of datapoints within range 500.2341 to 500.2361. Similarly, B 1_mz = 500.24, but intensity is set as the maximum of the intensities of datapoints within range 500.2433 to 500.2453. So we see that, although both have the same m/z values in lower resolution, they definitely belong to different sequences and have different intensities (thus might have different pattern as well).

Supplementary Method C Scanning Procedure by IsoGrouping Module
This method is intentionally kept similar to DeepIso. Although we have upgraded the internal architecture significantly, for the sake of user convenience we do not change the scanning procedure. Therefore, we request our readers to refer to the corresponding material from DeepIso.  Peaks connected by the black line are detected as a peptide feature by all algorithms. However, the feature having a lower peak, connected by the orange line, is missed by other tools. Because they have the same m/z as the one with a higher peak. Therefore, most of the tools merge it with the bigger one during pre-processing steps. (c) Monoisotope of the feature enclosed in the orange rectangle is missed by other tools due to merging with A. Here, feature A and C are detected by PointIso. But other tools detect A, and instead of C, they report B by mistake, missing the monoisotope (merging it with A). (d) Very closely residing and overlapping features, like feature B in blue and C in orange rectangles are sometimes missed by other tools as well, although detected by PointIso. (e) Feature with broken signals are detected by our model, but discarded by other algorithms.