Protein identification by 3D OrbiSIMS to facilitate in situ imaging and depth profiling

Label-free protein characterization at surfaces is commonly achieved using digestion and/or matrix application prior to mass spectrometry. We report the assignment of undigested proteins at surfaces in situ using secondary ion mass spectrometry (SIMS). Ballistic fragmentation of proteins induced by a gas cluster ion beam (GCIB) leads to peptide cleavage producing fragments for subsequent OrbitrapTM analysis. In this work we annotate 16 example proteins (up to 272 kDa) by de novo peptide sequencing and illustrate the advantages of this approach by characterizing a protein monolayer biochip and the depth distribution of proteins in human skin.


Supplementary
Amino acid fragments used in ToF-SIMS of proteins, first assigned by Wagner and Castner 1 . The listed ions were used in the Bi3 + ToF-SIMS image (Figure 1a).  Figure 1 Positive mode 3D OrbiSIMS spectrum of lysozyme (blue) and positive mode LMIG ToF-SIMS spectrum of lysozyme (red). Argon gas cluster ion beam (GCIB) results in large multi amino acid fragments, which can be detected and assigned with high accuracy by the Orbitrap TM analyser. Spectra intensities have been normalised to total ion count.
Supplementary Figure 2 Positive mode 3D OrbiSIMS spectrum of lysozyme highlighting example sections of amino acid sequence (red, blue, purple), explained by the peptide fragments assigned in the spectrum. In the example section of the protein sequence, colour coded bold amino acid residues are a result of the direct observations and coloured non-bold adjacent residues are inferred from the database protein sequence. Values in the brackets show deviation of each residue assignment. Amino acid neutral losses can be assigned with confidence due to high mass accuracy of the Orbitrap TM analyser. The total ion dose per measurement was 1.63 × 10 11 .
Supplementary Table 2 List of analysed proteins from the smallest (insulin, 51 amino acid sequence) to the largest (fibronectin, 2446 amino acid sequence). Sequence coverage achieved with the 3D OrbiSIMS is presented as number of assigned amino acids and fraction of the whole protein sequence.  Supplementary Figure 5 Proposed structures of observed ions dissimilar to CID MS of tryptic peptides, similar to HCD or MALDI-ISD fragmentation. (a) Internal ion ya is presented on an example NAWVA sequence observed in the lysozyme spectrum. (b) C-terminal ion z+1 is presented on an example RGCRL sequence observed in the lysozyme spectrum. The elemental composition and theoretical m/z of the structure was generated in ChemDraw.

Name
Supplementary Table 3 Examples of detected disulphide bonds in the 16 analysed proteins. The referenced Supplementary Tables provide full assignments of the relevant sequences.

Supplementary
Supplementary Table 4 Peak list exported from SurfaceLab negative mode depth profile through human skin, consisting of ions detected in the spectrum and assigned as sequence PGE of collagen. The m/z values represent the experimentally observed center mass of each peak. The deviation (dev.) represents the parts per million (ppm) accuracy of the assignment. The colours represent the labelling in the depth distribution of the assigned peaks in Supplementary Figure 10.  Table 5 Peak list exported from SurfaceLab negative mode depth profile through human skin, consisting of ions detected in the spectrum and assigned as sequence QHGSG of corneodesmosin. The m/z values represent the experimentally observed center mass of each peak. The deviation (dev.) represents the parts per million (ppm) accuracy of the assignment. The depth distribution of the assigned peaks is presented in Supplementary Figure 11.  Table 6 Peak list exported from SurfaceLab negative mode depth profile through human skin, consisting of ions detected in the spectrum and assigned as sequence SFGGGG of keratin. The m/z values represent the experimentally observed center mass of each peak. The deviation (dev.) represents the parts per million (ppm) accuracy of the assignment. The depth distribution of the assigned peaks is presented in Supplementary Figure 12.

Supplementary Note 1: Analysis of a protein biochip
A practical application of the developed method is demonstrated by detection of information-rich protein fragments on a protein biochip. The spectra of the biochip contain information from the first two scans of the surface, after which the signal rapidly declines, as the monolayer of the protein is The actual amount of protein molecules is smaller than the amount of cyclodextrin molecules and is determined by the size of the protein. Therefore, the maximum amount of lysozyme molecules in the analysed area is 2.18 × 10 10 (40 femtomoles). This amount of lysozyme allows detection of seven distinct lysozyme fragments (Supplementary Table 7).

Supplementary Table 7
Lysozyme fragments visible in the spectra obtained from a protein monolayer sample. Protein monolayer was obtained by immobilisation of the protein on a self assembled monolayer (SAM) on a gold slide. Three areas on a sample were analysed and a SAM sample without the protein was analysed as a control. The Supplementary Sensitivity and limit of detection of the peptidic fragments were also calculated for the lysozyme film.
In reference lysozyme film samples, the amount of material removed during sputtering is assumed to be pure protein. The amount of protein molecules can be calculated by Equation 2, where ρ is the density of the protein (1.37 g/mL) 4 , A is the analysed area, 200 × 200 μm (40000 μm 2 ), d is the depth of material consumed during the analysis, estimated by the SurfaceLab software and confirmed by profilometry (300 nm, Supplementary Table 8).

Equation 2
The profilometry results allow for calculation of the depth resolution of the instrument with the chosen settings. In reference protein samples, with 300 nm total depth after 30 scans, depth per one scan was 10 nm. This is consistent with SurfaceLab estimation based on the primary ion current. In cryogenic skin depth profile, depth was estimated based on the knowledge of the skin layers with stratum corneum making up approximately outermost 20 μm of the skin. 5 Based on the profile of phosphate marker PO3 -, underlying epidermis was reached after 1500 scans (Supplementary Figure 9), therefore the depth resolution calculated is 13 nm per scan.
The amount of lysozyme molecules that enabled the protein fragment assignment in the analysed sample was 6.8742 × 10 11 (1 picomole). The amount of lysozyme analysed from the biochip monolayer sample (40 femtomoles) is sufficient for the detection of seven diagnostic peaks, however does not enable direct primary structure analysis of this protein. database was undertaken to determine the degree to which proteins can be identified from sequences deduced from their mass spectrum. For each protein sequence of length n in the database of 21417, all n-l sequences of length l, from 2 to 20 residues, were searched for, with isoleucine substituted for leucine, as they have identical mass. In this analysis first 3 residues were considered by us to be of unknown in composition, because experimentally they could not be assigned through sequencing. The composition of the initial tripeptide can be assigned using the MS/MS capability of the 3D OrbiSIMS instrument. The number of proteins that contained a match for each of its n-l fragment sequences was

Supplementary
counted. Supplementary Figure 14 shows the number of proteins that can be uniquely identified from its sequence from residue 4 to residue 3+l. The total number of proteins which shared at least one sequence with that tested is shown in Supplementary Figure 15a. The fewest number of proteins that shared a sequence with the test protein is shown in Supplementary Figure 15b. 89% of the proteins can be identified only by a known N-terminal sequence of 8 amino acids (Supplementary Figure 14).
The capability to readily identify an unknown protein from a N-terminal sequence is limited due to the presence of mid-sequence ions in the spectrum, however the described method provides information about amino acid sequences of sufficient length to enable assignment of 89% of human proteins provided the protein spectra databases are suitably adapted or devised. The method as it is, Supplementary Figure 14 Statistical analysis of the fraction of the proteome that can be identified by N-terminal sequences. 89% of the human proteins in the UniProt sequence database (28th July 2019) can be confidently identified if the smallest fragment is a tripeptide and is followed by five amino acid residues.
Supplementary Figure 15 (a) Analysis of human proteins in the UniProt sequence database (28th July 2019) that share 5, 8 or 11-residue long sequences with any other. Depending on the composition of the sequence found, the sequence may be diagnostic or common to many proteins. (a) 93% of the proteins share an 8-residue sequence with at least one other protein.
(b) 95% of the proteins, however, contain an 8-residue sequence that is unique to the particular protein. 89% of the proteins contain a 5-residue unique sequence.

Supplementary Note 3: Automated sequence search
Conventional software for protein identification, including programmes focusing on de novo sequencing such as PEAKS, Novor, PepNovo, MSNovo, UniNovo and others 6 , are not appropriate for the data produced by the combination of GCIB and Orbitrap TM due to different mode of fragmentation. These tools also do not allow for identification of a protein from an image. In order to enable automatic high-throughput identification of intact proteins directly from a surface, a de novo sequencing script was developed in MATLAB ( Figure 3) and the code is available online (https://github.com/guferraz/simsdenovo/). The input to the script is a peak list exported from data analysis software, here IONTOF SurfaceLab 7.1 as a text file containing peak m/z and intensity values ( Figure 3b). Chemical filtering is based on matching the masses of the input peak list to the exact masses of possible peptidic fragments of each ion type (all combinations of up to 10-membered a, b, c and a-NH3). Intensity filtering is done on transformed values of Intensity × mass (I × m) to retain the high mass information-rich. Filtering allowed to shorten 2450 peak-long lists to 250 peak-long lists.
Residues identification is done by calculating a matrix of differences between all pairs of masses in the filtered peak list and checking for matches to amino acid residues within a given tolerance in ppm.
All differences without a match are discarded from the matrix. As a result, each residue has a "beginning" (matrix rows) and "end" ion (matrix columns), the distance between which is unique to the given residue. In Figure 3d, each residue is given a different colour. Sequences are searched by sequentially connecting the "end" ion of a given residue to the "beginning" ion of the next residue (Figure 3d inset). Sequences are elongated until no further residue can be connected and the search is done recursively using all found residues as seed points. 3 to 8 membered sequences are retained and ordered from longest to shortest. Protein identification is done by comparing found and rated sequences to the UniProt database for a given taxonomy. Ranking is done iteratively starting from the longest sequence: i. find all the database protein sequences where the sequence matches a terminal fragment of a protein (first or last 15 members). ii. for each protein found, count the number of times any of them appear. iii. the next longest sequence is then searched for (back to i) and continued until all the sequences have been used. The full list of proteins is ranked from highest to lowest number of counts.
The script was tested against the 16 analysed proteins. The pre-processing filtering parameters have to be set individually per each sample due to different spectra qualities and a diagnostic tool was developed for this purpose (Supplementary Figure 16). For each protein, peak lists generated by applying 13 chemical filters to the original peak list, were subsequently filtered by I × m varying with equal increments in a logarithmic space between 10 5 and 10 8 (Supplementary Figure 16).
Supplementary Figure 16: Diagnostics of the optimal chemical and intensity filters for identification of proteins from unknown spectra. Chemically filtered peak lists for each protein were subsequently filtered by intensity. The rank shows the relative position of the detected protein against all proteins in the database. Detailed rank for each protein is listed in Supplementary Table 9.

Supplementary Table 29
Peak list exported from SurfaceLab spectrum of horse cytochrome C, consisting of ions detected in the spectrum and assigned as internal fragment of the amino acid sequence NKGI. The m/z values represent the experimentally observed center mass of each peak. The deviation (dev.) represents the parts per million (ppm) accuracy of the assignment. The colour corresponds to the presence of the observed sequence in horse cytochrome C presented in Supplementary Figure 18.  Figure 19 Horse lysozyme (a) cartoon exported from PDB entry 1AZF 9 and (b) amino acid sequence exported from the UniProt database. The highlighted colours correspond to assigned segments of the amino acid sequence, presented in Supplementary Tables 31-36.

Supplementary Table 31
Peak list exported from SurfaceLab spectrum of chicken egg lysozyme, consisting of ions detected in the spectrum and assigned as sodium adducts of N-terminal sequence KVFGRCE. The sequence is observed as a, b, c and a-NH3 ions. The m/z values represent the experimentally observed center mass of each peak. The deviation (dev.) represents the parts per million (ppm) accuracy of the assignment. The colour corresponds to the presence of the observed sequence in chicken egg lysozyme presented in Supplementary Figure 19. The sequence is observed as a, b, c and a-NH3 ions. The m/z values represent the experimentally observed center mass of each peak. The deviation (dev.) represents the parts per million (ppm) accuracy of the assignment. The colour corresponds to the presence of the observed sequence in chicken egg lysozyme presented in Supplementary Figure 19.

Sequences specific to bovine serum albumin (BSA):
Supplementary Table 97 Peak list exported from SurfaceLab spectrum of bovine serum albumin, consisting of ions detected in the spectrum and assigned as sodium adducts of N-terminal sequence DTHK. The sequence is observed as a, b and a-NH3 ions. The m/z values represent the experimentally observed center mass of each peak. The deviation (dev.) represents the parts per million (ppm) accuracy of the assignment. The colour corresponds to the presence of the observed sequence in bovine serum albumin presented in Supplementary Figure 26.

Sequences present in both BSA and HSA:
Supplementary Table 104 Peak list exported from SurfaceLab spectrum of bovine serum albumin, consisting of ions detected in the spectrum and assigned fragments of N-terminal sequence DTHKSEI. The m/z values represent the experimentally observed center mass of each peak. The deviation (dev.) represents the parts per million (ppm) accuracy of the assignment. The colour corresponds to the presence of the observed sequence in human serum albumin presented in Supplementary Figure 26.  Table 107 Peak list exported from SurfaceLab spectrum of bovine serum albumin, consisting of ions detected in the spectrum and assigned fragments of N-terminal sequence DTHKSEI. The sequence is observed as a-NH3 ions. The m/z values represent the experimentally observed center mass of each peak. The deviation (dev.) represents the parts per million (ppm) accuracy of the assignment. The colour corresponds to the presence of the observed sequence in human serum albumin presented in Supplementary Figure 26.