Complex precursor structures of cytolytic cupiennins identified in spider venom gland transcriptomes

Analysis of spider venom gland transcriptomes focuses on the identification of possible neurotoxins, proteins and enzymes. Here, the first comprehensive transcriptome analysis of cupiennins, small linear cationic peptides, also known as cytolytic or antimicrobial peptides, is reported from the venom gland transcriptome of Cupiennius salei by 454- and Illumina 3000 sequencing. Four transcript families with complex precursor structures are responsible for the expression of 179 linear peptides. Within the transcript families, after an anionic propeptide, cationic linear peptides are separated by anionic linkers, which are transcript family specific. The C-terminus of the transcript families is characterized by a linear peptide or truncated linkers with unknown function. A new identified posttranslational processing mechanism explains the presence of the two-chain CsTx-16 family in the venom. The high diversity of linear peptides in the venom of a spider and this unique synthesis process is at least genus specific as verified with Cupiennius getazi.


Results
Transcriptome analysis of Cupiennius salei venom glands discloses the existence of a very high number of small, in majority Cys free cationic peptides, which is unique among spiders 20 . About 24% of all transcripts of C. salei refer to such linear peptides (LPs) and they can be grouped into nine different cupiennin families (Cu 1-9), one CsTx-16 family, and four hypothetical peptide families (h_pep_1-4) (Fig. 1).

Transcript compositions.
On the first view, four main transcript families (A-D) encode the different LPs in more or less complex precursor structures enclosing more than ten different LPs, all separated by linkers and ending in two different C-terminal structure families. Minor differences, however, allow to distinguish two variants for the A, C, D families, and five variants for the B family (   20 . Below this, the percentage distribution (reads) of all identified LPs is presented. The distribution and quantity of related peptides within the different peptide families are also illustrated in pie charts, in identical colors, corresponding to their percentage presence in the transcriptome. CsTx-16 is counted as two-chain peptide. www.nature.com/scientificreports/ Signal peptides and propeptides. In general, transcripts encoding LPs are composed of short signal peptides followed by different propeptides and small LPs, which are separated by short, anionic amino acid residues. Such elements are defined as spacers or linkers and are characterized N-terminally by an inverted processing quadruplet motif (iPQM) and C-terminally by a processing quadruplet motif (PQM). Signal peptides are composed of 21 amino acid residues (transcripts A, B, C) or 20 amino acid residues (transcript D). The lengths of the anionic propeptides (theoretical pIs 3.85-4.33) vary between 25 and 42-49 amino acid residues and more than one third of them are Glu and Asp. The propeptides exhibit α-helical structures (28-70%) and dependent on their lengths, three to four PQMs and two to three iPQMs. Additionally, the C-terminal parts of the propeptides are used one to four times as linkers between the peptides within the corresponding transcript (transcripts A, B, C). Interestingly, on protein level, transcript B and C1 exhibit identical signal peptides and differ from the signal peptide of transcript C2 in only one amino acid residue and from transcript A in three amino acid residues. Propeptides of all four transcript families exhibit major differences in amino acid composition and length (S1A-E Fig).
Linkers. All LPs from C. salei are separated by linkers within the corresponding transcript. For data analysis, a linker is defined as a short amino acid sequence starting and ending with Arg and exhibiting both, iPQM and PQM. However, posttranslational processing by proteases will occur always after Arg and the remaining C-terminal Arg of the foregoing peptide and of the linker will be removed by a so far unknown Arg-C proteinase 17 . Linkers are mainly transcript family specific, even though the linker RTENEIDEEDER belongs to the C-terminal part of propeptides A2 as well as B2 and links different peptides: Cu 1a with Cu 4c (transcript A) as well as Cu 8 with Cu 3d, Cu 3d with Cu 4a, Cu 4e with Cu 4a, and Cu 4f with Cu 1a (transcript B). All linkers within transcript A (n = 18) and B (n = 10) are composed of 12 amino acid residues and exhibit at least one Glu in the iPQM (iPQM1 variant) and in the PQM. Characteristic are molecular masses around 1529 Da, and a mean anionic pI of 4.08, which is caused by five to seven Glu/Asp residues per linker.
Nearly half of the linkers of transcript C1 and C2 are composed of 11 amino acid residues and the N-terminal iPQM possesses no Glu residue but a Ser residue (iPQM2 variant). Such linkers are identified only before and between CsTx-16 single peptide chains (n = 3). The mean molecular mass is 1284 Da and these linkers exhibit a higher pI (4.44) due to less Glu/Asp in its sequences. Remarkably, in transcript C2 one linker, which is composed of 13 amino residues, was identified and also confirmed by NGS. Also, all linkers of transcript D peptides exhibit 13 amino acid residues, and show structurally only few differences. Six main linkers connect the most diverse members of the LP family Cu 6. Comparable to some transcript C linkers, the N-terminal Glu is replaced by a Ser and a Thr residue.
In total, we identified a minimum of 43 different linker sequences by 454-seq, responsible for the arrangement of different LPs in different transcripts (Table 1).  www.nature.com/scientificreports/ Besides these linkers, C-terminal truncated linkers (t-linkers) were identified in transcript A and C, which exhibit iPQM1 and iPQM2 structures, no C-terminal PQM and are followed by a stop signal, two Lys residues and again a stop signal (S1A- E Fig). Transcript families. Four transcript families encode all LPs. Transcript A and B families appear similar in their composition, even though different members of the Cu families are encodes. Especially the succession and the presence of different or identical peptides (e.g., Cu 2a, transcript A), separated by different or equal linkers, is variable. Additionally, several short cuts between different peptides have been identified in different transcripts (in detail: S1A- E Fig). In transcript A, members of four different Cu families were identified: Cu 1a, c, d-f, Cu 2a-f, Cu 4c, d, and Cu 5a, b, e. Likewise, in transcript B, six different Cu families were identified: Cu 1a, Cu 3d, Cu 4a, b, e, f, Cu 5b, c, d, Cu 7 a, b, and Cu 8a. Interestingly, only Cu 1a and Cu 5b occur in both transcript families, but apart from this, different variants of the same Cu family exist (Fig. 3).
On the first view, the C-terminal transcript structures of the A and B families are not so clear, but overlapping linker/peptide/linker/stop codon analysis provides evidence that transcript A1 ends after Cu1a C-terminal with a t-linker. The t-linker starts with an iPQM1 but misses a PQM and after 12 amino acid residues, comparable to the length of all linkers in the transcript A1 and B families, a stop signal appears, followed by two Lys residues and again a stop signal. However, the transcript B families exhibit C-terminally after Cu 1a a specific linker before the terminal peptide Cu 7a, b, followed by a stop signal. This linker is characterized by an iPQM2-PQM structure and is also composed of 12 amino acid residues. Supportive for this data interpretation is the fact, that Cu 1a and Cu 1f peptides of transcript A1 exhibit a silent mutation for Tyr 28 in third position (TAC). This mutation is additionally present in Cu 1a and Cu 1f which are followed by a t-linker. In contrast to this, Cu 1a identified C-terminally in family B, exhibits TAT as cDNA code for Tyr 28 (S1A and S1B Fig, S2A Fig).
Comparably, Cu 5b peptides are identical in both transcript families only on protein level. Within transcript A, four different variants have been identified and a fifth variant occurs in the transcript B family. Seven point mutations in third position (transition as well as transversion) are the driving force for that variation (S1A and S1B Fig, S2B Fig).
Transcript B and C1 families share on protein level an identical signal peptide. However, on nucleotide level only signal peptides of transcript B subfamilies are identical and differ by the signal peptide of C1 in three point  Transcript C1 and C2 families are in so far unique, as LPs occur that exhibit as specific feature one Cys residue in the middle of the sequences (Fig. 2). Taking previous proteomic results into account 15 , always two succeeding peptides are connected by a specific linker (iPQM2-PQM structure, 11 amino acid residues), and posttranslationally processed to a family of two chain peptides, CsTx-16, by the formation of one disulfide bridge. In the transcript C1 family, after Cu 3a the CsTx-16 chains a1, 2, 3 are followed by the CsTx-16 chains b1, 2, and 3. Likewise, the CsTx-16 chains c1, 2 are followed by the CsTx-16 chains d1, 2. A short variant is given in transcript C2, where after a short propeptide the CsTx-16 chains e and f appear, followed by a so far not identified peptide family, Cu 9. The C-terminal structure of the transcript families C1and C2 is comparable to the structure identified in transcript A. After the chains CsTx-16 d1, 2 and Cu 9a, b, t-linkers (iPQM2-PQM, 16 amino acid residues) terminate the transcripts with a stop signal, followed by one or two Lys residues and again a stop signal (Table 1, S1C and S1D Fig).
Taking the C-terminal structure of transcript B into account, the transcript may hypothetically also be terminated by a further peptide group and a stop codon. Such a structure was identified in a very low read amount for transcripts C1 and C2, which is caused by an indel mutation (addition of one adenine) in the t-linker after RSFDSN, resulting in a frame shift. As consequence, the t-linker ends after 21 amino acid residues with a PQM, followed by the hypothetical peptides 1a, b, c, and a stop codon (S1C and S1D Fig, S3 Fig).
The transcript D family shows no relationship to all above described transcript families in terms of comparable peptides or linkers. Here, six only slightly different and transcript family-specific linkers (iPQM2-PQM, 13 amino acid residues) connect 17 different peptides, all belonging to the Cu 6 family (Table 1). Cu 6a, c, h, and i, are identified one to three times at different positions within the transcript. C-termination is done with Cu 6f, followed by a twofold stop signal. Comparable to the C-terminal situation of transcript C, an indel mutation (here a subtraction of one adenine) in Cu 6f, when compared to h_pep_2a, occurred. Through the corresponding frame shift, this created three further hypothetical peptides (h_pep_2ab, h_pept_3ab, and h_pep_4ab) with high read numbers. Interestingly, no further linker could be identified between them, but two dibasic ArgC protease cutting motifs "RK" and a PQM motif, were identified (S1E Fig, S4 Fig). Cupiennin families. Nine different peptide families were identified by 454-seq and NGS. With the exception of the Cu 5 and 9 families, all other families exhibit a C-terminal Gly residue, which is used for C-terminal amidation of the mature peptides as deduced from proteomic data. For most peptides, α-helical parts could be theoretically calculated. Furthermore, four hypothetical peptide families were detected through transcripts C and D analysis. With NGS, 134 LPs were identified and 39 peptides of them are identical to the 84 peptides identified by 454-seq. In total, 179 small cationic LPs were identified in both transcriptomes ( Table 2, S1A Table).
The Cu 1 family is the best investigated cytolytic peptide family of C. salei. All known peptides (Cu 1a, b, c, d) 15 were identified in the transcriptome by 454-seq, as well as further members of the Cu 1 family, Cu 1e and Cu 1f by 454-seq, and Cu 1g to Cu 1r by NGS. The peptides are composed of 36 amino acid residues, including the Gly for C-terminal amidation of the mature peptides. They have molecular masses between 3717 and 3931 Da, www.nature.com/scientificreports/ and are highly cationic with pIs between 9.78 and 10.54. The more hydrophobic N-terminus is connected by a sixfold repeat of four amino acids, starting always with Lys, and ends in a more hydrophilic C-terminus. All identified Cu 1 variants show predicted α-helicities between 44 and 75%. The Cu 3 family (28 amino acid residues, 2969-3109 Da) differs only in the C-terminal part from the Cu 1 family, where the last two C-terminal repeats are missing. Comparable to the Cu 3 family, the Cu 4 family (28 amino acid residues, 2940-3184 Da) misses likewise the last two repeats of Cu 1, and confirms the truncated variants obtained by proteomics studies. All peptides are characterized by pIs mainly above 10, a high net charge between + 4 and + 8, and predicted α-helical structures between 46.4 and 78.6% (S1A Table). An overview on the sequence logo 26 of the Cu 1, Cu 3, and Cu 4 families highlights the related peptide structures (Fig. 4).
The Cu 2 family (36 amino acid residues, 3719-3887 Da) mainly differs in the C-terminal part from Cu 1, apart from that the physical and chemical parameters are comparable to those of the Cu 1, 3, and 4 families.  www.nature.com/scientificreports/ Even though the Cu 9 family (35 amino acid residues, 3775-3802 Da) differs in its N-and C-terminal part from Cu 1, a sequence of 12 amino acid residues is on nucleotide level nearly 100% identical, except a point mutation in third position of Thr (Fig. 4). The Cu 5 family is mainly present in transcript C, but also contains Cu 3d that primarily belongs to transcript B. In opposite to most other identified peptide families, the Cu 5 family exhibits no C-terminal Gly, implying no C-terminal amidation of the mature peptides. With 24 amino acid residues, molecular masses between 2585 and 2713 Da, and pIs nearly all above 10, these peptides are characterized by the alternating sequence of cationic amino acid residues and non-polar and hydrophobic amino acid residues (Fig. 5).
The Cu 6 family (24.3%) is the most diversified (20-22 amino acid residues) LP family. It comprises 17 different peptides identified by 454-seq ( Table 2, S1A Table), five of them have been elucidated on proteomic site 15 , and further 31 peptides by NGS, whereas eight of them are identical to peptides identified by 454-seq. In addition, two related peptides (h_pep_2ab, 454-seq) variants were identified, in total 42 peptides. These peptides exhibit www.nature.com/scientificreports/ molecular masses between 2458 and 2826 Da, pIs between 9.52 and 11.70, and 13.6-35.0% of their amino acid residues are positively charged. Strikingly, the first third of the sequences are characterized by hydrophobic amino acid residues which are mainly connected by a short α-helical part to the polar/charged C-terminus. Except for Cu 6f and h_pep_2ab, all other peptides are C-terminally amidated. An overview on the sequence conservation within the Cu 6 family and the relative frequency of each amino acid residue at a certain position is given as sequence logo 26 (Fig. 6A). Furthermore, the mutations behind this  The C-terminally identified Cu 7 family in transcript B is confirmed by earlier proteomic data 15 . With 18 amino acid residues these peptides are the shortest cationic LPs so far identified in spider venoms. About one third of all amino acid residues are referring to Arg or Lys and in the middle of the peptides short α-helical structures, mainly of cationic amino acid residues, are predicted (S1A Table).
A further short LP family is Cu 8 (transcript B), which is composed of 20 amino acid residues and about one third of all amino acid residues are Lys or Arg. In the middle of the peptide, a short α-helical structure is predicted. Interestingly, only four peptide variants were identified (Cu 8a, 454-seq; Cu8b, c, d, NGS), albeit both peptides are the first peptides after the propeptide (Fig. 7A, S1A Table).
Only on transcriptomic site identified, the Cu 9 family consists of two variants (Fig. 4). Strikingly, within the α-helical part of the peptides, a sequence of 12 amino acid residues (Val 12 to Gln 23 ) is identical with a sequence part of Cu 1 (Val 13 to Gln 24 ), except one silent mutation in Thr 15 .

Hypothetical peptide families and t-linkers. Usually, a transcript ends C-terminally with a stop codon
after the encoded peptide, as shown for all neurotoxin transcripts, identified in the venom gland transcriptome of C. salei 20 . After the analysis of the C-terminus of transcript A and B families, two possibilities were found. The first possibility results in a C-terminal peptide, which is connected by an unusual linker (RSLNFMDNEEQR) with the previous peptide, and followed by a single stop codon e.g., Cu 7 (transcript B). The second possibility exhibits a t-linker after the last peptide, lacking a C-terminal PQM motif, followed by a stop signal, two Lys residues and again a stop signal (transcript A). When we apply these two variants to all other identified transcript families, more hypothetical peptides can be identified for the transcript C families. Transcript C1 and C2 families exhibit after an unusual long linker of 21 amino acid residues (Table 1), a comparable peptide, named hypothetical peptide h_pep_1a, b, c, followed by a stop signal (Fig. 7A). These peptides (2900 Da, 26 amino acid residues) have been identified only in a low read number (2.7% of both C-terminal endings) and probably play no important role (S1C and S1D Fig).
Transcript D exhibits also two C-terminal processing possibilities. In contrast to transcript A and C, the first possibility is characterized by Cu 6f, which is followed by two stop signals, and no t-linker. The second  (Table 3). Interestingly, the two-chain peptides CsTx-16a1b1 and CsTx-16c1d1 are about ten-fold less toxic than variants of the Cu 1 family. When comparing the insecticidal activity of the cupiennins with neurotoxins isolated from C. salei, only CsTx-1 is by a factor of 10 more active than Cu 1b. However, the determined insecticidal activity (flesh flies) of cyto-insectotoxin 1a (CIT 1a), one of the longest linear peptides so far described for spider venoms (Lachesana tarabaevi) 18 , is in the same range as CsTx-1.
Phylogenetic aspects. Is the high diversity of LPs identified in the venom gland transcriptome of C. salei, species or genus specific? We analyzed the venom gland transcriptome of Cupiennius getazi, the sister species of C. salei. As expected, from every cupiennin family at least one identical LP on peptide level was identified in both species. Beside 24 shared LPs, additional 30 new variants of known cupiennins, and four new peptides, Cu 10, 11, 12, and 13 were found ( Table 2, S1B Table). Furthermore, signal peptides starting the transcript families A, B, C, and D of C. salei are identical (transcript B and C) or very similar (transcript A and D) with the corresponding transcript families in C. getazi. Both variants of the striking C-terminal t-linkers identified in the C. salei transcriptome are present in C. getazi. Therefore, we assume that this unique importance of LPs in the venom and the corresponding processing mode is typical for the genus Cupiennius.
Though we could not find the here described peptides in the closer related families of lycosids, pisaurids, and oxyopids, the highest similarity could be detected in lycosids (unpublished results L. Kuhn-Nentwig).

Discussion
Over 315 million years, spider venom evolution was driven by two crucial factors, constant selective pressure on the efficiency of its components as well as optimization of their biosynthesis costs. The oldest strategy identified in mygalomorph spiders refers to the synthesis of hundreds of single, target specific neurotoxins, with targets usually being ion channels. Recruitment of a single ancestral DDH and/or ICK gene into the venom gland of the Australian funnel web spiders, followed by multiple gene duplications, diversification, and selection of neurotoxins www.nature.com/scientificreports/ by adaptive evolution, but also intragene duplications, are the key for the production of such neurotoxins 29 . This implies high metabolic costs for the spider, as one peptide precursor results in only one specific mature neurotoxin, for one specific target. The more such neurotoxins are available in venom, the higher the chance, to subdue a large diversity of prey types and to minimize the probability of resistance evolving against a specific neurotoxin, but also the higher the production costs. An alternative strategy becomes apparent in some modern araneomorph spider families, through the additional production of less specific small, cationic and insecticidal LPs, paralleled by a concurrent reduction of the number of ICK-containing neurotoxins 10,20 . Besides C. salei, only for two further spider families, zodariids and oxyopids, very limited knowledge about the corresponding transcript structures is available. An analysis of the oxyopid transcript structure of oxyopinin 1a,b, and c (tr|A0A5J6SIH8, tr|A0A4D6Q2Y9, tr|A0A4D6Q7V4) shows simple precursors, and the transcript structures of oxyopinin 2 (tr|A0A5J6SEB1_OXYTA; submitted by Vassilevski, Kozlov, Grishin, 2018) highlights a binary precursor structure resulting in oxyopinin 2a and 2b 13 . As reported for the zodariid L. tarabaevi, latarcins are mainly synthesized as simple precursors and a minority as binary precursors, as one precursor results in one or two active peptides, or in the case of a complex precursor, in five to six LPs, all separated by linkers 10 . This situation is comparable to those in the long transcript families A-D in C. salei. The optimization strategy as realized in C. salei implies longer and more complex transcripts, nearly no simple or binary precursors, and the consequent separation or linking of LPs by specific linkers. Mainly four different transcript families are responsible for the expression of nine LP families and four hypothetical LP families, resulting in 179 different mature LPs as identified by 454-seq and NGS. This is for C. salei nearly a threefold higher output of active peptides when compared with its identified 81 transcripts of neurotoxins resulting in a total number of 54 neurotoxins 20 . Such a strategy reduces metabolic costs considerably and makes C. salei, according to current knowledge, the most successful modern spider in terms of diversification of venomous components, especially of LPs, in its venom.
Previous investigations of Cu 1a and Cu 1d highlight these LPs as "cytolytic all-rounders", because they are active on bacteria, eukaryotic pathogens such as trypanosomes and plasmodia, the causative organism of malaria. Additionally, diverse human blood cells, human leukemic cells and human tumor cells are destroyed by these peptides 14,30 . With such properties, cupiennins are comparable to versatile, membrane active LPs identified in lycosids 31,32 and zodariids 9 .The insecticidal activities of cupiennins differ by a factor of 100 when comparing Cu1b and Cu 7b. This indicates that the targets might be different. Hence, C. salei always endeavors to optimize the venom activity by synergistic activities 20,22,24,33 in which the toxicity on insects is potentiated, e.g. by neurotoxin merging 25 .
As mentioned above, gene duplication, diversification and intragene duplication are only a part of the explanation of such an extraordinary amount of identified LPs. Spider DNA is characterized, in contrast to other arthropods, by short exons and long introns 34 . It is tempting to speculate that alternative splicing of such genes, gene-based combinatorial peptide library strategies and the induction of a hypervariability-generating Table 3. Insecticidal activities of selected cupiennins, CsTx-16 variants, neurotoxins CsTx-1, CsTx-9, and CsTx-13, and cyto-insectotoxin CIT1a. Different concentrations of peptides were dissolved in 0.1 M ammonium acetate, pH 6.1, and 0.05 µl was injected into Drosophila flies. The lethal dose (LD 50, pmol/mg fly), where 50% of the flies die after 24 h, was calculated with PRISM Vers. 6.07. * Data are from 14 , ** from 25 , *** from 18 , using flesh flies. www.nature.com/scientificreports/ mechanism 35 , as visible for example in the Cu 6 family (Fig. 6), are the driving force behind this tremendous diversity of LPs (Fig. 1).
There is no obvious relationship between the length of the mainly anionic propeptides and the following number of cationic LPs in C. salei. However, up to three different iPQMs and four PQM motifs were identified therein, which allows a fast degradation of such peptides. Comparably, some of the propeptides of latarcins also exhibit iPQM and/or PQM structures 9 . In contrast, the above mentioned transcripts of oxyopids exhibit the longest propeptides (52 and 54 amino acid residues), no iPQM and only one PQM structure, but also the longest LPs (oxyopinin 1a, b, c) composed of 48 amino acid residues 13 .
Linkers between different LP precursors are multifunctional. During the translation of long transcripts, they connect and simultaneously separate highly cationic peptides, based on their high content of negatively charged Glu and Asp residues. Their specific N-terminal iPQM and C-terminal PQM structures (Table 1) provide access to possibly specific proteases, which (1) release different LPs and (2) facilitate the specific formation of disulfide bonds between well-defined pairs of linear peptides, each containing one Cys, resulting in the CsTx-16 family as describe here for the first time. Such a production mode allows a higher number of different combinations of the corresponding variants of the single chains of CsTx-16ab, cd, and ef. We can assume that the iPQM structure and the length of linkers may influence the kinetics of the proteolytic activity, because the linkers before and after the corresponding peptide pairs have to be cut before the formation of the disulfide bond and the linker between the corresponding single chains after the disulfide formation.
The here proposed maturation of CsTx-16 peptides, highlights a second biosynthesis pathway for two-chain peptides. So far, two-chain neurotoxins as CsTx-8, 12, and 13 from C. salei 20,23 , or omega-aga-1A from Agelenopsis aperta 36 are the result of a specific proteolytic activity of a PQM-protease on a single peptide 17 . After the formation of disulfide bonds within these peptides, an iPQM and PQM structure is identified by a specific protease, resulting in the described two-chain peptides.
Length and structure of linkers seem to be at least specific for a spider genus, if not for a family. From both Cupiennius species investigated here, linkers are identical, they are mainly composed of 13 amino acid residues, and in the context of CsTx-16, they consist of 11 amino acid residues. L. tarabaevi exhibits linkers composed of eight, nine or ten amino acid residues 10 and in Oxyopes takobius a linker of 25 amino acid residues connects oxyopinin 2a and 2b.
Strikingly, a high number of reads, especially in transcript A, exhibits C-terminally a t-linker, followed by a peculiar arrangement of two stop signals, separated by two Lys. The function of such t-linkers is still enigmatic. They show sequence similarities to a part of the EF-hand 2 low-affinity calcium binding site of the neuronal calcium sensor protein recoverin of vertebrates 37 , but also to a part of the DEAD/DEAH box helicase of bacteria (WP_146269254.1:758-769, Mucilaginibacter sp.). Possibly, this indicates venom compounds with yet unknown function.
In conclusion, we see a considerable application potential of the expression strategy of C. salei concerning such long inactive peptide chains. They can be used as blueprint for the concurrent recombinant expression of selected LPs in bacteria, which can be further processed in a second step with the help of the PQM protease 17 . Many LPs are active on a large number of negatively charged membrane systems and also not identified targets 30 , present in all living organisms. These properties make them of interest for the pharmaceutical industry 5,9,38 . Also intracellular effects through a possible uptake of such peptides into different cells in terms of efficient intracellular drug delivery 39,40 should be investigated.

Material and methods
Spider maintenance and cDNA libraries of venom glands. C. salei (own spider stock since 1990) and C. getazi (originated from Costa Rica; A. Leetz) were laboratory bred and adult spider were used for the generation of transcriptomes. A first cDNA library of C. salei venom glands was generated by 454-sequencing (Skuldtech, Montpellier, France) 19 . The venom glands of 20 adult female spiders were dissected after electrical milking at different time intervals (24,48, and 62 h and 8 and 14 days) and stored in RNAlater (Qiagen). The samples were sent to Skuldtech and from pooled venom gland material RNA was extracted. The quality and quantity of RNA were tested by Nanodrop and Bioanalyser 2100 Agilent and 15 µg were used for EST library construction. Sequencing runs yielded 460,000 ESTs and sequence assembly was performed by de novo using MIRA_2 (V2.9.25 with enhanced 454 support) using 98% of homology. The assembly resulted in a total of 34,107 contigs and 202,877 single sequences.
Later, further cDNA libraries of C. salei and C. getazi venom glands were generated on an Illumina HiSeq3000 platform (University of Bern, Switzerland). Briefly, from 16 adult female C. salei spiders and 31 adult C. getazi (18 male and 13 female) species the venom glands were dissected after milking at different time intervals ( and stored in RNAlater. The extraction of total RNA was done by an in-house protocol combining phenol/chloroform extraction with the RNeasy mini kit (Qiagen). The quality and quantity of RNA was assessed by Nanodrop, the Qubit RNA BR assay kit (Qubit 2.0 fluorometer; Thermo Fisher Scientific) and by an advanced analytical fragment analyzer system (fragment analyzer RNA kit, DNF-471, Agilent). One µg of RNA was used for each cDNA library preparation with the Illumina TruSeq-stranded mRNA prep kit. For further sequencing double barcoding and selected fragments with lengths between 300 and 600 bp (Pippin HT system, Sage Science) were used. To avoid cross contaminations between cDNA libraries of different spider venom gland transcriptomes, both libraries were multiplexed (25% per lane) timely independent, and with other non-arthropods, mostly genomic libraries of vertebrates. Assemblage of the resulting reads was done using Trinity version 2.1.1 (C. getazi) and version 2.5.2 41  www.nature.com/scientificreports/ Transcriptome analysis of 454-seq data and assemblage of transcript A, B, C1, C2 and D families. All so far on proteomic site identified cupiennins and CsTx-16 15 were used to search with BlastP (E-threshold, 0.0001) against the above mentioned 454-transcriptome (454-seq). Firstly, the obtained contigs were analyzed and sorted in terms of identical peptides, belonging to different cupiennin families. Secondly, a further sorting was done by taking specific linkers, which are in N-or C-terminal position to the peptides, into account. A linker sequence is characterized N-terminally by a PQM and C-terminally by an iPQM motif. With the obtained specific N-terminal-or C-terminal linker information of the peptides, peptide chains were assembled and elongated in both directions until N-terminally the signal peptide or C-terminally a stop codon was identified.
For the assemblage of the transcript families A, B, C1, C2, and D (see S1A-E Fig), only transcriptomic data from 454-seq with the following conditions were used: every peptide/linker unit had to be identified at least in two contigs, composed at least of five reads. Mainly completely identified peptide sequences, built at least of a linker/peptide/linker or a peptide/linker/peptide structure, were used for overlapping elongation of the protein sequence. Counting of the identified peptides at a certain position in the transcript families was done by read counts of the corresponding contigs. For the determination of the content of individual peptides, reads of only fully premature sequences including a C-terminal glycine for amidation, which is characteristic for several cupiennin families (S1 Table), were counted (Fig. 1, S1A-E Fig). The transcriptomic data analysis workflow, and especially the manual construction of the transcript families, is given in detail in (