Collagens are structural proteins that are abundant and ubiquitous in the connective tissue of all metazoan animals. Animal collagens comprise some 28 diverse molecules that are characterised by a rod-like structure where polyproline II-like polypeptide chains are super-coiled about a common axis to form a triple-helix1. Since the close packing between the chains is such that only glycine (Gly) can fit in the centre of the triple-helix, collagen-forming proteins contain characteristic tripeptide repeats (Gly-Xaa-Yaa)n, where Xaa and Yaa can be any residue. The structure is stabilized by a high content (~20%) of the imino acids proline (Pro) and hydroxyproline (Hyp).

Whilst it is commonly regarded that collagens function within animals, the existence of an extracorporeal animal collagen was postulated in the 1960s when K. M. Rudall described the X-ray diffraction pattern of some hymenopteran insect cocoons as collagen-like2,3. Rudall noted that “during the brilliant summer of 1959 there were plagues of gooseberry sawfly (Nematus ribesii)” and this enabled cocoons or silk fibres drawn from the salivary gland to be obtained in sufficient quantity for study2. The diffraction patterns suggest twisted cables of collagen molecules with dimensions of 3 nm diameter. Amino acid analyses found high Gly (33.6%) and Pro (10.0%) content consistent with collagens, as well as high Ala (12.2%) characteristic of silks. Of most interest was that Hyp, which is seen as a characteristic of animal collagens was absent4.

Other described insect collagens are apparently homologous to molecules found in mammals. For example, a collagen from Drosophila melanogaster5 is similar in structure to the type IV collagen found from hydra to human6. Type IV collagens contain a central domain of about 1200 residues containing the characteristic (Gly-Xaa-Yaa)n collagen repeat with high levels of Pro and Hyp, but with many interruptions (up to 25) from inserts of up to 30 amino acids. The collagen domain is flanked by large non-collagenous domains. Cysteine residues in the non-collagenous domains crosslink the proteins to generate the network structure of the extracellular matrix. Insect collagens with similarities in composition and length (~280–290 nm) to the predominant interstitial type I and type II animal collagens have been isolated, for example, from locust7 and cockroach8. The mature interstitial collagen proteins contain just over 1000 residues, which primarily consist of uninterrupted Gly-Xaa-Yaa repeats containing around 12% Pro and 10% hydroxyproline7,8. Rather than networks, these proteins assemble into fibrils and fibres.

Here, we examine the cocoon silk of the willow sawfly, Nematus oligospilus, a hymenopteran species closely related to the sawfly species described by Rudall2,3. The native silk structure and composition were investigated using wide angle X-ray scattering and amino acid analysis. A combined proteomic-transcriptomic approach was used to identify the silk genes. The silk proteins were expressed in Escherichia coli and the recombinant molecules were demonstrated to be protease resistant and have circular dichroism spectra characteristic of collagen.


X-ray diffraction patterns of silk

Diffraction studies using wide-angle X-ray scattering (WAXS) on N. oligospilus cocoon silk showed a 0.286 nm meridional reflection peak (Fig. 1A) characteristic of the axial distance between residues along strands of the collagen triple helix9. This feature was similar to that observed by Rudall for N. ribesii silks2,3. Other observed peaks from the N. oligospilus cocoon silk were consistent with a structural repeat with 4.59 nm spacing in the axial direction (Fig. 1B) and therefore were attributed to various harmonics of long axial repeats of a collagen structure. Equitorial spots at 1.2 nm, attributed to lateral spacing between triple helices by Rudall, were not observed in the willow sawfly WAXS and peaks at 0.326 and 0.382 nm were not observed in the gooseberry sawfly (Figure 1), implying differences in structural arrangement of the collagen molecules within the two silks. Previously, Rudall had attributed a meridional arc observed at 0.465 nm to protein in a cross-β structure, although no other characteristic peaks of cross-β structure were detected.

Figure 1
figure 1

Wide angle X-ray scattering (WAXS) data from N. oligospilus cocoons.

(A): WAXS spectrum (average of five spots). The individual 1-D WAXS spectra are shown in Supplementary Figure 1. (B): Proposed assignments of WAXS peaks from the Willow sawfly and comparison with peaks observed in Gooseberry sawfly (from Rudall)2.

Hydroxyproline (Hyp) analysis of cocoon silk

Amino acid analysis of the willow sawfly silk did not detect any Hyp, consistent with Rudall's results from gooseberry sawfly cocoons in the 1960s2,3.

Native silk production

The sawfly silk proteins are produced at high concentrations in the labial gland, a dedicated silk gland. The silk gland comprises a convoluted tubular structure of about 12–15 mm in length when not stretched, onto which a series of nodular structures are attached (Fig. 2). Secretory cells within these nodules produce the silk proteins, which are then secreted into the gland lumen10. Haematoxylin and eosin staining of the gland shows the protein organized within the secretory cells into tactoids (Supplementary Fig. 2).

Figure 2
figure 2

Images of silk and silk gland of sawflies.

(A): Silken cocoon with encased pupa. (B): Silk gland showing numerous nodules comprising large secretory cells attached to the lumen through delivery ducts. Scale bar 1 mm.

Mechanical and chemical properties of native silk

Mature silk fibres (2.2 ± 0.4 μm diameter) were obtained directly from larvae that had commenced spinning their cocoon. The mechanical properties of the silk fibres (Fig. 3) exceeded that of mammalian collagens, with breaking stresses of 322 ± 68 MPa and breaking strains of 34 ± 4%, compared to 100–120 MPa breaking stress and 13% strain at break reported for other animal collagens11. A further comparison of the mechanical properties of collagen and other biological fibres to various synthetic materials can be found in Gosline et al.12.

Figure 3
figure 3

Mechanical properties of cocoon silk.

(A): example stress-strain curves of three N. oligospilus silken fibres. (B): a summary of all data for measured fibres.

The sawfly cocoons could readily be dissolved at room temperature in common protein denaturants such as 8 M urea or 6 M guanidine hydrochloride. Following denaturing polyacrylamide gel electrophoresis analysis, solutions of solubilized cocoons resolved as three discrete proteins predicted to be 53, 47 and 32 kDa (Fig. 4A, native silk). Under reducing conditions the protein pattern was not altered, indicating that disulfide bonds were not present between the proteins.

Figure 4
figure 4

Sawfly silk proteins.

(A): SDS-PAGE of recombinant sawfly silk proteins alongside native silk proteins with protein marker ladder shown on left. (B): Circular dichroism spectra of recombinant sawfly cocoon collagens after pepsin treatment showing characteristic collagen maxima at about 220 nm, SfC A (− − − −), SfC B (– · · – · ·) and SfC C (– – –). The solid line (—————) shows a representative collagen spectra for comparison. The spectra for SfC B (– · · – · ·) and SfC C (– – –) are overlapping.

Identification of silk protein sequences

The silk proteins were identified after matching the mass of tryptic peptides from the three proteins from the cocoon (Fig. 4A) to predicted tryptic peptide masses from proteins encoded by three cDNA populations identified in a silk gland cDNA library isolated from late final instar larvae (53 kDa: 9 peptides; 47 kDa: 9 peptides and 32 kDa: 3 peptides; Fig. 5). Over 70% of the sequences obtained from the sawfly silk gland cDNA library encoded the three silk proteins. The measured tryptic fragments did not match any other proteins in the Genbank database. The primary amino acid sequence of the proteins is shown in Figure 5 and described in the discussion.

Figure 5
figure 5

Architecture and amino acid sequences of the willow sawfly silk proteins.

Tryptic peptides identified by mass spectroscopy from proteins in cocoon silk are underlined. The first position of each Gly-Xaa-Yaa repeat is shown in red.

Expression and analysis of silk proteins

The ability of the sawfly silk proteins to fold into collagens was confirmed after recombinant expression of the proteins. The small size and absence of Hyp in the silk proteins were conducive to their recombinant expression in standard E. coli fermentation systems (Fig. 4A, rSfC A-C). The recombinant proteins were expressed from two different vectors (pET and pCold) and in neither case did they migrate according to their predicted molecular weight or alongside their native equivalents. Whilst it is common for collagen proteins to migrate inconsistently with protein markers, the inconsistency with the native proteins suggests either the proteins are not fully unfolding in the guanidine hydrochloride solubilisation treatment or that post-translational modifications are occurring in the native system. The recombinant molecules were digested with pepsin at 20°C, a technique commonly used to isolate and purify collagen molecules. Consistent with the collagen structure, a substantial part of each of the recombinantly-produced sawfly silk proteins was resistant to pepsin digestion, whereas the entire molecule was protease-sensitive at higher temperatures. The circular dichroism spectra of the pepsin-resistant fragments confirmed the proteins were folded into collagen triple-helices (Fig. 4B), with each spectrum showing characteristic collagen positive ellipticity around 220 nm13.


The present study examines the silk (Fig. 2) of the willow sawfly, Nematus oligospilus, a hymenopteran species restricted to willows in temperate areas of Australia. As with all silks, the cocoon is produced from a concentrated protein solution that the insect accumulates in a dedicated silk gland. In the willow sawfly, the silk gland is derived from the larval labial gland, a common adaptation in insects10. In contrast to homologous glands where the silk proteins are produced in the anterior region and then accumulate in a posterior lumen, the lumen of the sawfly silk gland is ringed with secretary units along its entire length (Fig. 2) and the silk proteins are organized into higher level structures within the secretory units (Supplementary Fig. 1). The cocoon is made of silk-like micro-scale fibres (Fig. 2) and the X-ray diffraction pattern from the fibres is characteristic of a collagen triple helix (Fig. 1). However, the fibres dissolve readily in protein denaturants, a property not common to either collagens or silks (Fig 4A). Despite their comparatively low chemical stability, the mechanical properties of the silk fibres exceeds that of mammalian collagens (Fig. 3), with breaking stresses and breaking strains around three times higher than that reported for other animal collagens11. Collectively, these findings suggest the sawfly produces unusual silk using novel collagens.

The solubilized cocoon contained three silk proteins (Fig. 4A) whose primary sequence was identified from cDNA isolated from a silk gland cDNA library (Fig. 5). Analysis of the protein sequences conclusively identified them as collagen-forming proteins: each contained a central sequence block of 60–62 contiguous Gly-Xaa-Yaa repeats that comprised the majority of the protein sequence. Interestingly, the length of the collagen domains (180 to 186 residues) correspond to an axial length of 515 to 532 Å, similar to the axial period of 550 Å observed in negatively stained tactoids from gooseberry sawfly by Rudall (1967). The collagen structure in the silk proteins was further confirmed after the three proteins were individually expressed in E. coli (Fig. 4A) and purified recombinant molecules were demonstrated to be protease-resistant with circular dichroism spectra consistent with the collagen structure (Fig. 4B). Comparison of the primary sequence, size and molecular organization (Fig. 6) of the sawfly silk proteins to animal collagen proteins available in the Genbank database found the silk collagens were only similar to each other and not to other described collagens, suggesting they evolved independently of other collagens and thereby constituting a new class of animal collagens. We termed the sawfly silk sequences SfC A–C (Sawfly Collagen A, B or C).

Figure 6
figure 6

Comparison of the molecular architecture collagen-forming silk proteins SfC A–C to a selection of other animal collagens.

These include an interstitial collagen α1[I]; a network forming collagen, α1[IV]; a beaded filament-forming collagen, α1[VI]; and 2 FACIT collagens of different sizes, α1[IX] and α1[XXI]. Lines indicate comparative protein length and blocks indicate regions of collagen forming sequence within the chains. The published sequences for the collagens from the insect Drosophila melanogaster are highly homologous to the type IV sequences.

The vast majority (98%) of the triplets in the sawfly silk proteins are commonly found in other animal collagens, which only contain a few of the 400 possible combinations at any significant level14. Only 2% of the triplets, Gly-Glu-Phe and Gly-Lys-Ile in SfC A, Gly-Ala-Met in SfC B and Gly-Asn-Gln in SfC C, are considered rare14. However, the distribution and composition of imino acids in the sawfly silks is highly unusual for animal collagens, in that there are no hydroxyprolines, a post-translational modification of proline considered characteristic. In other animal collagens, the X-position in the primary sequence Gly-Xaa-Yaa triplet is commonly proline and the Y-position is frequently hydroxyproline. The overall high level of imino acids serves to correctly position the protein backbone for collagen folding15 and Hyp in the Y-position is associated with thermal stability16. A lack of hydroxyproline however, is not unique. Recently a new group of collagen-like proteins have been described from bacteria which also lack hydroxyproline17. A range of these proteins have been characterised as recombinant products18,19,20. This work has shown that these collagens were stable at around mammalian body temperature, 35°C–38°C, despite the lack of hydroxyproline, with intra-molecular ion pair formation being partly responsible for this stability21,22.

In addition to the absence of Hyp, the proline distribution in the sawfly silk proteins is biased, occurring predominantly in the X-position (53%, 39% and 38% for SfC A–C, respectively) and rarely in the Y-position (6%, 4% and 8%, respectively). In other animal collagens, proline occupies around 28% of the X-position and Hyp occupies about 38%23. However, the total proportion of imino acids in the sawfly silk proteins is similar to that found in other animal collagens, comprising around 16% of the residues in the collagen-forming domains, consistent with the role for this residue in assisting the folding of the protein backbone into the collagen structure.

Mechanisms to increase thermal stability in the sawfly silks, in the absence of Hyp in the Yaa position, include the presence of particular triplets that enhance collagen stability. Most notably, SfC A–C have 7, 4 and 6 arginine residues in the Yaa position, respectively. Arginine in the Y-position confers triple-helix stability similar to Hyp24. However, the ecology of the animal and biology of silk production make the requirement for thermal stability in the collagen molecules less stringent than the requirement in mammals. Sawflies are poikilothermic and therefore only require the collagen to be stable at environmental temperatures, rather than mammalian body temperature. Furthermore, the biophysical environment of the silk proteins both during silk fabrication and post-fabrication serve to enhance the collagen molecules stability. Unlike other collagens, the silk proteins are maintained in a highly concentrated silk protein solution in the silk gland. The thermal stability of collagen rises as water is removed, such as when aggregates form25,26 and therefore concentration and aggregation of the collagen molecules in the silk gland probably serves to enhance the molecule's thermal stability. After silk fibre fabrication, the collagen is dehydrated and the molecules will be stable at any normal environmental temperature.

The collagen domains of the sawfly silk proteins are flanked by variable length glycine-rich, non-collagenous domains of 49, 103 and 41 (N-termini) and 21, 46 and 84 (C-termini) residues for the SfC A, B and C chains, respectively. Repetitive motifs occurred in the C-terminal domain of the SfC B and SfC C proteins: eight repeats of the tetrapeptide Gly/Tyr-Asp-Asn-Lys in SfC B and 14 repeats of the pentapeptide Gly-Tyr-Asp-Asn-Lys in SfC C. With the exception of the C-terminal domain of SfC C, which showed 76% similarity over 83 residues to a Gly-Tyr-Asp-Asn-Lys repeat in an uncharacterized protein from the fungus Arthroderma otae, none of the terminal domains showed sequence similarities to known proteins. Generally, animal collagens contain N- and C-terminal peptides that are essential for folding of the triple-helix and correct fibrillogenesis but are then proteolytically removed and are not present in the mature material27. Several peptides corresponding to the N- and C-termini of the sawfly silk proteins were identified by liquid chromatography-tandem mass spectroscopy of proteins solubilized from cocoons (Fig. 5) indicating that most of these regions, if not all are present in cocoons. These regions may contribute to the mechanical properties of the fibres.

The data presented in this paper have demonstrated the presence of the collagen triple-helix structural motif in an insect silk, confirming the initial proposal by Rudall in the 1960s2. Collagen is an established biomedical material and many tons of animal-derived collagen are used annually in pharmaceutical and medical applications. However, there have been concerns over batch-to-batch variability and the transmission of diseases in the materials, leading to a preference for development of a recombinant material that can be produced under controlled conditions28. Replication of other animal collagens in recombinant systems requires co-expression of enzymes capable of converting a number of the proline residues in the protein to Hyp29,30 and, to date, such systems have not allowed large-scale production of the material31. The discovery of relatively small collagen proteins that fold into collagen triple-helices but do not contain hydroxyproline, that can be fabricated into materials from concentrated protein solutions and that form strong fibres without covalent cross-linking provides opportunities for collagen materials to be prepared as recombinant products in E. coli. The collagen silk proteins therefore constitute a new class of animal collagens ideally suited to the development of innovative collagen-based biomedical materials.


Collection and culture of insects

Willow sawfly larvae (Nematus oligospilus) were collected from willow trees (Salix sp.) around Lake Burley Griffin (Canberra, Australia) during the summers of 2007–2012. Larvae were maintained in the laboratory on a diet of fresh willow leaves.

Amino acid analysis

After spinning the cocoons, the sawflies were removed from the cocoons before metamorphosis was complete and cocoon parts not in contact with leaves or other substrate were selected for analysis. Six cocoons were washed in distilled water for 30 min three times and dried. The hydroxyproline composition of the washed silk was determined in duplicate after 24 h gas phase hydrolysis at 110°C using the Waters AccQTag Ultra chemistry at the Australian Proteome Analysis Facility Ltd (Macquarie University, Sydney). The limit of sensitivity in this analysis was estimated as 0.1 pmol total, corresponding to about 1 in 5000 residues.

Wide angle X-ray scattering (WAXS)

Diffraction patterns were collected at the SAXS/WAXS beam-line of the Australian Synchrotron. The beam-line was operated at a beam energy of 18 keV (wavelength of 0.6888 Å) with a sample to detector distance of 550 mm, yielding a q-range of approximately 0.08–2.6 Å−1. Use of a pixel counting detector (Pilatus-1M, Dectris, Baden, Switzerland) and evacuated flight tube achieved a reasonable signal to noise ratio from the weakly scattering sample. Data reduction and azimuthal integration, to produce 1D profiles, were achieved using scatterBrain (Australian Synchrotron).

Protein solubilization and gel electrophoresis

Sawfly cocoons were incubated in 6 M guanidine HCl with or without 5% v/v 2-mercaptoethanol for 30 min at 95°C, after which no solid material remained. The guanidine HCl was removed after proteins were precipitated by the addition of nine volumes of cold 100% ethanol, incubated at −20°C for 1 hr, micro-centrifuged at maximum speed (15000 g) for 15 min at 4°C, the pellet washed with 90% cold ethanol, the pellet dried and the proteins resuspended in SDS-PAGE running buffer as required. Solubilized protein solution was run on a Nu-PAGE 4–12% Bis-Tris protein gel (Life Technologies) and stained with Coomassie Blue R-250.

Mechanical testing

When disturbed, final instar larvae in the process of spinning would rapidly mov away from stimuli leaving a single fibre that could be collected for mechanical testing. Fibres were mounted across a 2 mm gap on paper frames, fixed at either end with epoxy glue and examined on an optical microscope to determine their exact gauge length and diameter (and to examine and discard any samples with defects before mechanical testing). Tensile measurements were carried out on an Instron Tensile Tester model 4501 at a rate of 2.5 mm.min−1. Tests were conducted in air at 21°C and 65% relative humidity. Data from fibres that broke at the mounting points were excluded.

Silk protein gene discovery

The DNA sequence corresponding to the silk protein genes were obtained from cDNA isolated from a cDNA library constructed from silk glands of four N. oligospilus larvae according to methods described previously32. Briefly, 29.1 μg total RNA was isolated using the RNAqueous4PCR kit (Life Technologies), from which mRNA was isolated using the Micro-FastTrack™ 2.0 mRNA Isolation kit (Life Technologies). This mRNA was used to construct a cDNA library using the CloneMiner™ cDNA kit (Life Technologies). The cDNA library comprised approximately 2.9 × 107 colony forming units, with an average insert size of 1.1 kb. From this library, 50 randomly chosen clones were sent for sequence analysis. Three genes represented by three distinct groups of sequences containing repeating (Gly-Xaa-Yaa)n sequences were found in 20 of the clones and these were termed SFC A, B and C (deposited in Genbank as KF534808, KF534809 and KF534807). Other sequences contained vector only sequences with the exception of five, which were found as single occurrences, with possible identities determined from database searches as peptidase, esterase, actin, RNA helicase and translation elongation factor. SfC A and SfC B were represented by two sequences differing by 8 and 6 single nucleotide polymorphisms respectively, suggesting the presence of two separate alleles. No variations were seen within the SfC C group of sequences.

Mass spectrometry

The protein bands on SDS-PAGE gels were cut out with a razor and analyzed as described previously32. The proteins were digested with trypsin and the resultant peptides were analysed by liquid chromatography-tandem mass spectrometry on an Agilent LC/MSD Trap XCT spectrometer. Agilent SpectrumMill software was used to compare the peptides to both the sequences obtained from a cDNA library of the sawfly silk gland and to the entire NCBI database of protein sequences.

Protein expression and purification

Full length cDNA clones from each N. oligospilus collagen types (SfC A–C) were PCR amplified using oligonucleotides designed against the 5′ and 3′ cDNA sequences and with appropriate restriction enzyme sites for transfer into the pET14b vector (Novagen) behind the His tag. The amplicons were purified, digested with the restriction enzymes, cloned into pET14b vectors (Novagen) and the sequences verified by DNA sequencing of the inserts. Constructs with the correct sequence were used for expression in E. coli Rossetta 2 (DE3) cells (Novagen) using Overnight express media (Novagen) at 20°C for 48 hrs. Alternatively, sequences were modified at either end using the Quickchange Site-directed Mutagenesis Kit (Agilent Technologies) according to the manufacturer's instructions, to allow transfer into the pDONR 222 vector (Invitrogen). In order to increase expression of SfC C, the V-domain from S. pyogenes Scl2 gene, which is a registration and triple-helix promoting sequence, was inserted at the N-terminal of the construct. Sequences were then inserted into pCold expression vectors (Takara Bio Inc., Japan), then transformed into competent E. coli BL21 cells for expression as required. Transformed cells were grown on auto-induction media (Novagen) at 20°C or 2YT media (16 g.L−1 Trypton; 5 g.L−1 yeast extract, 5 g.L−1 NaCl) containing 0.1 mg/ml ampicillin at 37°C in shaker flasks for 7 h, cooled to 25°C, induced with 1 mM IPTG, grown for a further 10 h and then a further 16 h at 15°C. The cells were harvested by centrifugation (3000 g for 30 min).

Cells were lysed using Bugbuster and expression levels was assessed using SDS PAGE. For circular dichroism, cells were lysed by sonication in 40 mM sodium phosphate buffer pH 8.0 and the cell lysate clarified by centrifugation (20,000 g for 40 min) and the clear supernatant retained. The expressed silk proteins were purified by absorbing the clarified lysates on an IMAC HyperCel™ column (Pall), with elution by stepwise increments up to 500 mM imidazole, adjusted to pH 8.0 with HCl. Cross-flow filtration was used to lower the salt content and to concentrate the protein solution. Final purification was by gel permeation chromatography on a HiPrep Sephacryl™ S-200 column (GE Healthcare). The individual triple-helical collagen segments were prepared by digestion of the proteins with 0.1 mg/ml pepsin in 50 mM acetic acid. Purity of all products was assessed by SDS-PAGE.

Circular dichroism spectroscopy

Circular dichroism spectra were collected for pepsin treated silk protein samples in 50 mM acetic acid using 1 mm path length cells in a JASCO J-815 instrument.