A multi-protease, multi-dissociation, bottom-up-to-top-down proteomic view of the Loxosceles intermedia venom

Venoms are a rich source for the discovery of molecules with biotechnological applications, but their analysis is challenging even for state-of-the-art proteomics. Here we report on a large-scale proteomic assessment of the venom of Loxosceles intermedia, the so-called brown spider. Venom was extracted from 200 spiders and fractioned into two aliquots relative to a 10 kDa cutoff mass. Each of these was further fractioned and digested with trypsin (4 h), trypsin (18 h), pepsin (18 h), and chymotrypsin (18 h), then analyzed by MudPIT on an LTQ-Orbitrap XL ETD mass spectrometer fragmenting precursors by CID, HCD, and ETD. Aliquots of undigested samples were also analyzed. Our experimental design allowed us to apply spectral networks, thus enabling us to obtain meta-contig assemblies, and consequently de novo sequencing of practically complete proteins, culminating in a deep proteome assessment of the venom. Data are available via ProteomeXchange, with identifier PXD005523.

L. intermedia, the total protein content of this venom has remained unclear. A previous study from our group applied two-dimensional immunoblots and zymograms on the venom of L. intermedia, L. laeta, and L. gaucho, and revealed several spots with differential volume containing proteins having gelatinolytic activity corresponding to astacin-like proteases 33 . These results corroborate that venoms from these species present a broad astacin-like family with many isoforms 22,33,34 .
The lack of genomic data from this arachnid prevents employing the PSM approach in full, so most of the weightlifting must be accomplished through de novo sequencing. Mainstream de novo sequencing, however, cannot efficiently handle unanticipated post-translational modifications, being far more prone to generating sequencing errors. This is because various molecules fail to provide enough mass spectral peaks during fragmentation to enable the sequencing of full peptides. To overcome these limitations, our dataset was acquired with multiple dissociation strategies applied to the same precursor (e.g., collision-induced dissociation (CID), higher-energy collisional dissociation (HCD), and electron-transfer dissociation (ETD)), thereby enabling the use of state-of-the-art de novo sequencing algorithms. These capitalize on complementary dissociation information and thus achieve unprecedented sequencing accuracy 35,36 . The use of different proteolytic enzymes on the venom aliquots unlocks the application of another very powerful paradigm, that of spectral networks 37,38 . These 'specnets' align spectra against one another, ultimately allowing the detection of unanticipated post-translational modifications. Moreover, they can assemble consensus mass spectra from overlapping peptides yielded by different proteolytic digests. A consensus spectrum thus obtained presents a better signal-to-noise ratio and allows for the de novo sequencing of amino-acid stretches far longer than those handled by the conventional approach. Once high-confidence de novo data are available, it becomes possible to employ tools, such as PepExplorer 39 or Meta-SPS 37 , that apply pattern recognition approaches to the mapping of de novo sequencing data against sequences from homologous organisms, thereby facilitating biological interpretation.
By themselves, the meta-contig assemblies provided by spectral networks are not enough for one to conclude whether a biomolecule obtained 100% coverage. To pave the way in this direction, top-down proteomic data in combination with MS3 (i.e., product ion(s) selected from an MS/MS spectrum further fragmented and producing another tandem mass spectrum) and ETD were also acquired for a partition of the venom molecules into two sets ( o~10 kDa and >~10 kDa). The top-down strategy consists of injecting intact proteins into the mass spectrometer, thus doing away with the inference limitations of the peptide-centric approach 40 . This provides complementary information to that of the networks and helps in the discovery of how much is required for obtaining full coverage. We anticipate that these data will be fundamental in the development of next-generation algorithms capable of bridging the gap between bottom-up, middle-down, and top-down proteomics.
Here, we present the first multi-protease, multi-dissociation, bottom-up-to-top-down proteomic dataset of the venom of L. intermedia, the 'urban' spider species commonly found in the city of Curitiba, Brazil 41 , along with an analysis using state-of-the-art tools. The approach stems from the motivation that multiple enzyme digestion increases protein coverage 42 , besides relying on different activation and acquisition methods.

Sample preparation
Adult L. intermedia specimens (both male and female) were collected in the wild in accordance with the Brazilian Federal System for Authorization and Information on Biodiversity (SISBIO-ICMBIO, license number 29801-1). Venom from 200 spiders was extracted through the electrostimulation method 43 and immediately diluted in ammonium bicarbonate buffer 0.4 M/urea 8 M. Protein concentration was determined through the Coomassie blue method, using bovine serum albumin (BSA) as standard curve 44 . First, the venom was separated into two fractions using an ultra-filter unit (MW cutoff 10 kDa) (Millipore), one fraction containing venom proteins above~10 kDa (400 μg) and the other containing venom proteins and peptides bellow~10 kDa (90 μg). All procedures described next were performed equally for each fraction, after further dividing it into four aliquots, each of which was reduced with dithiothreitol (DTT) to a final concentration of 25 mM for 3 h at room temperature. Afterwards, the samples were alkylated with iodacetamide (IAA) to a final concentration of 80 mM for 15 min at room temperature in the dark. Each aliquot was digested with one of the follow enzymes: trypsin (Trypsin Gold, Mass Spectrometry Grade, Promega Corporation, Madison, cat. No. V5280, WI, USA), chymotrypsin (Promega, cat. No. V1062), and pepsin (Promega, cat. No. V1959) at the ratio of 1:50 (E:S). We note that an additional aliquot was stored and not digested. Three aliquots were incubated individually with each enzyme for 18 h, at 25°C for chymotrypsin and 37°C for trypsin and pepsin. The other aliquot was incubated for only 4 h with trypsin at 37°C. Each digested fraction was divided into three aliquots and desalted with ultra-micro C-18 spin columns according to the manufacturer's instructions (Harvard Apparatus). One of these three aliquots was stored for future use, another had its peptides desalted and directly submitted to reverse phase chromatography coupled online with an Orbitrap XL mass spectrometer. The third aliquot of the desalted peptides was eluted with 70% acetonitrile (ACN) and 0.1% formic acid, then dried in a speed vacuum concentrator, suspending buffer C (i.e., 10 mM of K 2 HPO 4 , 25%ACN, pH = 3.0). Afterwards, the sample was passed through a micro strong cation exchanged spin column (SCX) according to the manufacturer's instructions (Harvard Apparatus). Briefly, the column was equilibrated with buffer C, centrifuged for 1 min at 100 × g, and the www.nature.com/sdata/ SCIENTIFIC DATA | 4:170090 | DOI: 10.1038/sdata.2017.90 sample was eluted from the SCX spin column with increasing concentration of KCl, i.e., 100, 170, 290, and 400 mM. Finally, each fraction was desalted once more with ultra-micro C-18 spin columns according to the manufacturer's instructions (Harvard Apparatus). All columns were then washed ten times with 0.1% formic acid and the peptides were eluted with buffer B (i.e., 70% acetonitrile, 0.1% formic acid) to proceed to next step.

Mass spectrometry analysis
Each fraction of peptides, including the non-fractionated as well as those from the SCX fractionation, was previously desalted and subjected to an LC-MS/MS analysis on a nano-LC 1D plus System (Eksigent, Dublin, CA), an ultra-high performance liquid chromatography (UHPLC) system coupled with an LTQ-Orbitrap XL ETD (Thermo, San Jose, CA) mass spectrometer, at the Mass Spectrometry Facility RPT02H of the Carlos Chagas Institute (Fiocruz, Brazil). In these analyses, the peptide mixtures were loaded onto a column (75 mm i. d., 15 cm long), packed in-house with a 3.2 μm ReproSil-Pur C18-AQ resin (Dr Maisch) with a flow of 500 nl/ min and subsequently eluted with a flow of 250 nl/min from 5 to 40% ACN in 0.5% formic acid in a 120 min gradient. The mass spectrometer was set to data-dependent mode to automatically switch between MS and MS/ MS acquisition. Full-scan MS spectra (m/z 350-1,800) were acquired in the Orbitrap analyzer with resolution R = 60,000 at m/z 400 (after accumulation to a target value of 1,000,000 in the linear trap) using survey mode. The three most intense ions were sequentially isolated and fragmented using CID, HCD, and ETD for the same precursor. Previous target ions selected for MS/MS were dynamically excluded for 60 s. The total cycle time was approximately 5 s. The general mass spectrometric conditions were: spray voltage, 2.4 kV; no sheath or auxiliary gas flow; ion transfer tube temperature, 100°C; collision gas pressure, 1.

Bioinformatics analysis
The de novo sequencing approach employed in this work utilized multiple MS/MS spectra from overlapping peptides, generated from multiple proteases and of precursors analyzed with CID, HCD, and ETD spectrum triples. Each was then converted into prefix residue mass (PRM) spectra. In this conversion, MS/MS peak masses were converted into putative cumulative precursor fragment masses, with intensity scores determined from likelihood models specific to each fragmentation mode. Triples of PRM spectra from the same precursor were then merged into a single PRM spectrum per precursor by adding scores for matching peak masses. Spectral-network algorithms, implemented in the ProteoSAFe web platform that is freely accessible at http://proteomics.ucsd.edu/ProteoSAFe/, were then used to align merged PRM spectra from peptides with overlapping sequences. Moreover, A-Bruijn algorithms were used to integrate these alignments into assembled contigs.
Each contig was then used to construct a consensus contig spectrum, or meta-contig, capitalizing on the corroborating evidence from all of its assembled spectra to yield a high-quality consensus de novo sequence 36 . Subsequently, the Meta-SPS algorithm was used to align the meta-contigs against a FASTA sequence database 37 . This database contained all Loxosceles sequences from UniProt, all from the transcriptome of the L. intermedia venom gland 20 , and an internal database with common mass spectrometry contaminants and proteases.
A summary of this methodology is found in Fig. 1.

Data Records
Our bioinformatics analysis disclosed a list of 190 proteins ( Table 1). As far as we know, this is the most complete comprehensive proteomic profiling of the L. intermedia venom. All mass spectrometry data are available from both the ProteomeXchange Consortium via the PRIDE 45 partner repository, with dataset identifier PXD005523 (Data Citation 1), and our servers (http://proteomics.fiocruz.br/pcarvalho/lintermedia/ venom/). A full list of the proteins, meta-contigs, and homologous sequences is made available in Table 1. All Meta-SPS results for >~10 kDa and o~10 kDa, together with the parameter files used for running the software, are available as separate material (MetaSPS_Results.xlsx, Data Citation 2). The results are presented in six tabs, viz., for >~10 kDa grouped by contig, >~10 kDa grouped by spectrum, >~10 kDa parameter file, o~10 kDa grouped by contig, o~10 kDa grouped by spectrum, and o~10 kDa parameter file.

Technical Validation
The lack of any previous comprehensive proteomic analysis of the Loxosceles venom demonstrates that studying this venom in detail has been a challenge, one that stems from the organism being highly noncanonical and from the fact that protein sequences for it have remained scarce in databases. The present work circumvented these obstacles by using a combination of shotgun proteomic experiments and different tools to generate and analyze large proteomic datasets and de novo sequencing results.
Our results revealed 190 protein identifications, including all classes of toxins described in previous transcriptome analyses 19,20 (Table 2 (available online only)). Our approach identified both high-and lowabundance toxins of the L. intermedia venom, as well as homolog sequences from distinct Loxosceles species (astacin-like proteases, PLDs, peptides, TCTPs, hyaluronidases, allergens, serine proteases, serine protease inhibitors, and housekeeping proteins) ( Table 2 (available online only)). These data reinforce the holocrine nature of the Loxosceles venom gland 23 and demonstrate that its venom is composed of toxins and housekeeping proteins originating from epithelial-cell content, such as the angiotensin converting enzyme, the 60S ribosomal protein, the Na-Pi co-transporter, and the myosin heavy chain (Table 2 (available online only)). Our results, therefore, validate the method used for analyzing the proteome of an organism with non-sequenced genome.
Taken together, the identified toxins in the L. intermedia venom include representatives from all toxin groups, even if in low abundances (as in the case of, e.g., hyaluronidases and serine proteases). We also find it noteworthy that we obtained significant coverage of the three major families present in the venom, viz., PLDs, astacin-like metalloproteases, and ICK peptides. These families are of great importance for studies of the brown-spider envenomation features and of biotechnological and medical applications.
Many of the aligned contigs mapped to distinct PLD isoforms from a variety of Loxosceles species. In fact, these toxins are the most studied and well-characterized components of the Loxosceles venom 5,20,26,31,[46][47][48] . PLDs are able to reproduce the deleterious effects observed in loxoscelism and represent a great target for drug discovery against brown-spider envenomation 2,5 .
As for the astacin-like metalloproteases identified, we note that astacins were first described as an animal-venom component in 2007 (ref. 28) and only later recognized as a family of toxins present in the Loxosceles venom 33 . These toxins present proteolytic activity on distinct extracellular matrix proteins and are related to the hemostatic effects in loxoscelism 43,49 .
ICK peptides, the major components of the L. intermedia venom-gland transcriptome (54,9% of the expressed sequence tags), were identified with correspondence to all four different ICK peptides described for L. intermedia (LiTx1, LiTx2, LiTx3, and LiTx4) 50,51 . These ICK peptides, also called knottins, are characterized by the neurotoxic properties they exhibit on ion channels and receptors expressed in the nervous systems of insects and mammals 52 . The high expression of LiTx transcripts, which correlates with the proteomic results found herein, are consistent with the venom's effects of paralyzing and killing both preys and predators 1,20,51 .   53 ), as well as contaminants and the proteases used during sample preparation (trypsin, chymotrypsin, and pepsin).