The transcriptome of Darwin’s bark spider silk glands predicts proteins contributing to dragline silk toughness

Darwin’s bark spider (Caerostris darwini) produces giant orb webs from dragline silk that can be twice as tough as other silks, making it the toughest biological material. This extreme toughness comes from increased extensibility relative to other draglines. We show C. darwini dragline-producing major ampullate (MA) glands highly express a novel silk gene transcript (MaSp4) encoding a protein that diverges markedly from closely related proteins and contains abundant proline, known to confer silk extensibility, in a unique GPGPQ amino acid motif. This suggests C. darwini evolved distinct proteins that may have increased its dragline’s toughness, enabling giant webs. Caerostris darwini’s MA spinning ducts also appear unusually long, potentially facilitating alignment of silk proteins into extremely tough fibers. Thus, a suite of novel traits from the level of genes to spinning physiology to silk biomechanics are associated with the unique ecology of Darwin’s bark spider, presenting innovative designs for engineering biomaterials.

transcripts (complementary DNAs (cDNAs)) generated with the isoform sequencing (Iso-Seq) method to produce 10,666 consensus sequences.The Iso-Seq method of SMRT sequencing with Pacific Biosciences instruments generate continuous reads covering the complete length of individual cDNA molecules, overcoming the problems of assembling long, repetitive spidroins encountered with shorter reads from Illumina and Sanger sequencing (see Methods).To obtain additional expression information, a C. darwini MA gland transcriptome was also constructed from two Illumina RNA-sequencing (RNA-Seq) libraries producing 206,838 unique sequences.In both assemblies we surveyed spidroin diversity by clustering translated sequences containing C-terminal domains with ≥95% identity.This yielded 14 sequence groups, seven with best BLAST hits to MaSp sequences.The remainder had top BLAST hits to spidroins associated with other silk types: PySp (piriform/cementing silk spidroin), MiSp (minor ampullate silk spidroin), TuSp (tubuliform/egg-case silk spidroin), Flag (flagelliform/capture spiral silk spidroin), AcSp (aciniform/wrapping silk spidroin), and AgSp (aggregate/glue spidroin; Supplementary Data 1).
We examined the longest spidroin in each C-terminal cluster and found three were most similar to MaSp1, having GGX and A n motifs in repetitive sequence, whereas one was most similar to MaSp2 with combined GPGX, GGX, and A n motifs (Fig. 1).However, three newly described spidroins had C-termini with top BLAST hits to MaSp1 or MaSp2, but lacked poly-alanine (A n ) motifs.One of these, MaSp5, is mostly composed of GGX motifs.By contrast, MaSp4a and MaSp4b are strikingly unique spidroins enriched with, to our knowledge, novel GPGPQ motifs that occupy 44-52% of the repetitive region (Fig. 1c).MaSp4 is markedly different from the eight MaSps from the genome of another orb-weaver Trichonephila clavipes 7 (formerly Nephila clavipes 18 ), with GPGPQ only appearing once in Trichonephila MaSp-g and iterations of this motif are not seen in other species' spidroins.Our longest MaSp4a includes nine repeats, seven of which are 63 amino acids long and contain four to six GPGPQ motifs, one GPGG motif, and one VSVVSTTVS motif (Fig. 1c).We did not identify C. darwini spidroins similar to MaSp3 19 , which was recently described as having MaSp terminal domains but repetitive sequence lacking polyalanine and GPGX motifs.This absence of MaSp3 may be because our data does not survey all spidroins at the genomic level (how several MaSp3 sequences were identified 19 ), or it is also possible that this paralog is not present in C. darwini given the apparent frequency of gene duplications and losses in the spidroin family 20 .
Composition and structure of C. darwini MaSps and dragline.Typical orb-weaver dragline contains mainly MaSp1 and MaSp2, yielding a high content of glycine (34.7-42.2%),alanine (17.6-27.5%),and proline (1.7-15.7%) 21,22(Supplementary Data 2).Both proteins are dominated by glycine and alanine, but proline is almost exclusively found in the repetitive region of MaSp2 (9.1-16.4% vs. 0-0.5% in MaSp1, Supplementary Data 3a, b and 4a, b).Amino acid compositions of C. darwini MaSp1a-c and MaSp2 are similar to those in other species (Supplementary Data 3a, b-and 4a, b).By contrast, the repetitive region of C. darwini MaSp4 contains 31.4-32.0%proline and is deficient in alanine (2.2-2.5%;Fig. 2a).Higher proline is linked with greater silk extensibility 21,22 , and MaSp4's ~32% proline content substantially exceeds the 10.9-16.3%proline in the Flag spidroin from elastic capture silk (Fig. 2a; Supplementary Data 4b).We note that the C. darwini spidroins presented here are based on partial transcripts, as is most typically obtained from spidroin cDNAs.Nevertheless, amino acid compositions inferred from translations of partial spidroins are expected to be similar to full-length proteins given that spidroins are largely composed of highly repetitive sequences.
Instead, we propose the hypothesis that proline's high abundance and arrangement in GPGPQ motifs in MaSp4 may increase dragline extensibility either by forming novel structural domains embedded among other MaSp proteins or by packing in more β-turns per protein monomer.Specifically, as proline is the critical residue in forming a β-turn, the additional proline per GPGPQ motif, in comparison to MaSp2 GPGX motifs (where X is rarely P), may increase the number of βturns per motif or may produce distinct secondary structures altogether, given the steric constraints imposed by proline.These possibilities might lead to increased dragline extensibility through the addition of more β-turns within the amorphous (non-crystalline) regions or through the decreased alignment of molecules along the fiber.Testing these hypotheses would require detailed analyses of recombinantly expressed C. darwini MaSp4, along with MaSp1 and MaSp2, to understand how combinations of these proteins interact at the biophysical level to affect fiber mechanics.
Proline-rich spidroin closely related to typical MaSps.To investigate C. darwini dragline evolution, we reconstructed ).The assemblies also included seven spidroin N-termini from MaSp4a, MaSp4b, MaSp2, PySp, MaSp1 variants and MaSp5 (Supplementary Data 1).Their relationships similarly showed MaSp4a and MaSp4b within the MaSp N-terminal clade, but closest to Argiope and C. darwini MaSp2 (Supplementary Fig. 1).MaSp1 and MaSp2 do not form reciprocally monophyletic clades, previously attributed to intergenic concerted evolution and selection to homogenize co-expressed termini 24 .Nevertheless, our results imply the derivation of MaSp4 from a MaSp2 gene, consistent with the GPG-rich nature of both.Consequently, while MaSp4 retains terminal domains highly similar to typical dragline proteins, its repetitive structural sequence has substantially increased in proline suggesting its adaptive evolution to support C. darwini's giant webs.
We hypothesize that the lengthened spinning duct of C. darwini's MA gland may facilitate alignment of spidroins to maintain dragline tensile strength (potentially by maintaining βsheet formation) as MaSp4's GPGPQ-containing motifs introduce increased extensibility.The longer duct may enable this by allowing the dragline to form over a longer period.X-ray diffraction by Madurga et al. 41 showed that C. darwini dragline has 14% crystallinity (β-sheet structures), which was higher than the 7% crystallinity of Argiope aurantia dragline reported in that study, but is within ranges reported from Trichonephila clavipes Content courtesy of Springer Nature, terms of use apply.Rights reserved dragline (10-28% crystallinity) from other studies 42,43 .Accordingly, the lengthened MA duct of C. darwini may maintain dragline crystallinity, and hence strength, within ranges observed from other orb weavers, despite increased extensibility.This hypothesis should be tested in the future by investigating biochemical and physical processes along C. darwini's spinning duct.
Evolutionary implications and biomimetic applications.Thus, in addition to MaSp2 and MaSp1, C. darwini MA glands highly express MaSp4a transcripts, which encode a silk protein dominated by novel GPGPQ motifs.If these motifs form β-turns similar to GPGX motifs as suggested by Garnier analysis, this would introduce more secondary structures resembling the nanosprings of flagelliform silk by which C. darwini dragline could achieve greater toughness through increased extensibility.That the GPGPQ motifs of MaSp4a and MaSp4b appear restricted to C. darwini suggests a recent origin of these proteins within the genus from MaSp2 genes, consistent with selection for tough and extensible silk to support enormous orb webs.Caerostris darwini's lengthened MA gland's spinning duct may also contribute to assembly of especially tough dragline.Hence, a suite of traits from genes to physiology likely coevolved with the unique web architecture and ecology of C. darwini.
We anticipate these findings will be leveraged to produce silkbased materials mimicking the extraordinary toughness of C. darwini dragline.Such work could express C. darwini dragline spidroins in varying proportions, or engineer chimeric spidroins for biomaterials with enhanced functional properties 1 .An important open question is how spinning duct length shapes material properties.Accordingly, this study reinforces the importance of evolutionary comparative studies for discovering biotechnology opportunities.

Methods
Sequencing of MA gland expression libraries.Major Ampullate (MA) glands were dissected from C. darwini females reared by MK and MG in the laboratory, stemming from females collected in Andasibe-Mantadia National Park (around 18.94760°S, 48.41972°E at 960 m elev.),Toamasina Province, eastern Madagascar in 2012 and additional dissections were collected from specimens collected at the same locality in December 2017 (permit numbers 042N_EA04/MG12, 090/12/ MEF/SG/DGF/DCB.SAP/SCB, 315N_EA12/MG17, and 280/17/MEEF/SG/DGF/ DSAP/SCB, issued by Secretariat General, Direction des forets, Direction de la conservation de la biodiversite et du systeme des aires protegees).Major ampullate glands from seven C. darwini females were imaged with a dissecting microscope using Zeiss 2.3, along with MA glands from four female Argiope aurantia and three female Trichonephila clavipes.Duct and ampullate length were measured with ImageJ 1.50i (https://imagej.nih.gov/ij/).RNA was extracted from MA glands (one individual per extraction) by homogenization in TriZol and cleanup using Qiagen's RNeasy kit, and removal of DNA.Using one C. darwini MA gland RNA extraction (cd46), cDNA was synthesized at the UMass Medical School's Deep Sequencing Core (UMMS-DSC) using the Iso-Seq protocol 17 (Pacific Bioscience Inc., Menlo Park, CA, USA).cDNA was fractionated into two size distributions.The larger fraction >1.2 kb was used to construct a SMRTBell TM library, which was sequenced on three SMRTCells TM on a PacBio RS II instrument with 120-min movies.RNA from the MA silk glands of two individuals (cd46 and cd47) was submitted to the UMMS-DSC, where cDNA was synthesized separately for each individual using the creator SMARTer method (Takara Bio USA), and fragmented to 650 bp prior to Illumina RNA-Seq library construction.The two MA gland RNA-Seq libraries were sequenced on three separate MiSeq instrument runs, sequencing 300 bp paired-end reads.Illumina adapters and SMART oligos (Supplementary Data 9) used in cDNA synthesis were trimmed from reads using CUTADAPT 1.14 44 , which was also used for quality trimming.
Assembly of transcriptomes.Data from the SMRT sequencing cells were processed with Pacific Biosciences' Iso-Seq1 pipeline (part of the SMRT Analysis 2.3 p5 pipeline), where the RS_IsoSeq Classify script was used to identify non-chimeric full-length transcripts (i.e., sequences containing 5′-3′ primers and poly A tails), which were used as input to the Cluster script to collapse highly similar sequences into non-redundant consensus isoforms using the ICE algorithm, followed by use of the QUIVER algorithm to "polish" isoforms with highly similar but non-full length transcripts corresponding to each full-length isoform 17,45 .Given this pipeline separates partial cDNAs of the same transcript only differing in length (because they are interpreted as full length), we further clustered tBLASTn hits to spidroin terminal domain queries using CD-HIT 4.6.4 46at ≥95% nucleotide identity across their full length, selecting the longest sequence per cluster for analyses.A separate de novo transcriptome from all Illumina data was generated from 14.3 million reads using Trinity 2.0.6 47 .Review by NCBI's TSA submission pipeline identified that 0.13% of Trinity contigs represented likely contaminant sequences, which were removed from the assembly.BUSCO 2.0 (ref. 48) was used to evaluate the transcriptome by assessing the presence and length of conserved single-copy orthologs from arthropod species.Assembled sequences were subject to BLASTx searches against NCBI's nr database using Diamond 0.8.23 49 , retaining hits with e-scores ≤e − 0.5.Translations were produced based on the frame of significant BLAST hits, or the longest open reading frame in the absence of a BLAST hit.
Spidroin characterization.Spidroin sequences in transcriptome assemblies were identified using tBLASTn with known spidroin N-and C-terminal domain protein sequences as queries.getORF of EMBOSS 6.6.0 (http://emboss.sourceforge.net/apps/cvs/emboss/apps/getorf.html) was used to translate Content courtesy of Springer Nature, terms of use apply.Rights reserved spidroin transcripts, and the longest translation in the frame of the BLAST hit was identified.CDhit 46 was used to cluster translated spidroins into groups with full-length, identical terminal domains; manual inspection of all spidroin BLAST hits identified additional sequences containing complete termini, all of which were clustered into groups sharing ≥95% amino acid identity across the terminal domain.For each spidroin sequence cluster we characterized the repetitive structure of the longest sequence based on previously defined spidroin motifs 9,13,50 .Larger iterated ("ensemble") repeats composed of combinations of these motifs were defined by aligning highly similar consecutive sequence within proteins using MUSCLE v. 3.8.31 51, and computing a consensus reporting the modal residue for each position.
Sequences were designated as MaSp proteins if terminal domains were most closely related to previously defined MaSp termini.The MaSp1 or MaSp2 designation was based on the presence of amino acid motif combinations in the repetitive region previously defined as characteristic for those proteins 9,50 .As the recently described MaSp3 19 was not identified among C. darwini transcripts, nomenclature for MaSp sequences newly described here were named MaSp4a, MaSp4b, and MaSp5, where MaSp4a and MaSp4b may represent alleles of the same protein or closely related protein paralogs.Protein secondary structure (e.g., percent helices, sheets, and/or turns) was bioinformatically predicted with the garnier EMBOSS plugin in Geneious 9.1.8 52.
Expression analyses.Trimmed Illumina reads from the two MA gland RNA-Seq libraries were used to estimate abundance of all Trinity assembled transcripts in TPM using Salmon v0.8.2 in quasi-mapping mode 53 (Supplementary Data 6).Determining spidroin expression is challenging because of fragmented transcripts and incorrect mapping of repetitive regions, especially given Trinity-assembled spidroin transcripts typically contain differing lengths of repetitive sequence 20,54 .Thus, in addition to estimating TPM for all assembled Trinity-derived transcripts, we also reduced sequences in the Illumina assembly containing identical spidroin C-termini to a single representative, trimmed to 500 bp including the nonrepetitive termini.TPM was re-estimated for all transcripts in this revised assembly using Salmon v0.8.2 53 and aggregating TPM for spidroin C-termini if they shared ≥95% identity at the amino acid level (Supplementary Data 7).
Spidroin phylogenetic analyses.Spidroin terminal domain sequences (see Supplementary Data 8 for Accession numbers) were used in phylogenetic analyses 13 along with C. darwini sequences sampling C-termini representing gland-associated spidroins (TuSp1, MiSp, Flag, AcSp1, PySp1, AgSp) having linked N-terminal domains from different araneoid species, and including a greater sampling of MaSp C-termini from the family Araneidae to which Caerostris is classified.N-terminal phylogenetic analyses included sequences linked to C-termini used in the aforementioned C-terminal phylogenetic analysis.Sequences were aligned with MUSCLE 3.8.31 51.Bayesian phylogenetic trees were generated from amino acid alignments using Mr. Bayes v. 3.2.6 55 .Markov Chain Monte Carlo sampling with Mr. Bayes v.
3.2.6 was run with default priors, but implementing a mixed amino acid model for 5 × 10 6 generations plus gamma distribution, using three heated chains and one cold chain.Consensus Bayesian trees were computed from post burn-in trees (discarding the first 25%), and rooted using a mygalomorph spidroin (B.c.fibroin 1).
Silk fiber and gland protein analyses.Dragline fibers were collected from forcibly silked C. darwini females.Three samples of spun dragline from three individuals and one pair of C. darwini MA glands from a single individual were sent to the UC Davis Molecular Structure Facility, and hydrolyzed with 6 N HCl for 24 h at 110 °C.This was followed by ion-exchange chromatography using an L-8800 Hitachi analyzer coupled to a post-column ninhydrin reaction system to separate and detect amino acids.One dragline sample was also run on a L-8900 Hitachi analyzer using a lithium citrate buffer to detect hydroxyproline.Results were used to compute percent molarity of amino acids in samples.Amino acid compositions of spidroin sequences were determined with ProtParam (http://web.expasy.org/protparam).
Statistics and reproducibility.Measurement averages for silk glands are presented as mean ± standard deviation, with the number of independent biological replicates (different individuals) reported in the main text.Statistical analyses comparing differences of means were conducted with two-tailed unpaired t tests.Transcript expression measurements using RNA-Seq data were repeated with MA glands from two individual adult females.

Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH ("Springer Nature").Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users ("Users"), for smallscale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained.By accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use ("Terms").For these purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal subscription.These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to the extent of the conflict or ambiguity only).For Creative Commons-licensed articles, the terms of the Creative Commons license used will apply.We collect and use personal data to provide access to the Springer Nature journal content.We may also use these personal data internally within ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting.We will not otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as detailed in the Privacy Policy.While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may not: use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access control; use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is otherwise unlawful; falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in writing; use bots or other automated methods to access the content or redirect messages override any security feature or exclusionary protocol; or share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue, royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain.Springer Nature journal content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any other, institutional repository.These terms of use are reviewed regularly and may be amended at any time.Springer Nature is not obligated to publish any information or content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice.Springer Nature may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law, including merchantability or fitness for any particular purpose.Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed from third parties.If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not expressly permitted by these Terms, please contact Springer Nature at onlineservice@springernature.com

10 Fig. 1
Fig. 1 Unusual proline-rich proteins (MaSp4a and MaSp4b) among Caerostris darwini major ampullate Spidroin (MaSp) proteins.a Schematic of spider silk proteins (spidroins) composed of tandem-repeated amino acid sequences (ensemble repeats) flanked by non-repetitive amino (N)-and carboxy (C-) terminal domains.b Caerostris darwini MaSp proteins with each box presenting consensus repeat and repeat numbers determined from the longest sequence of each type (full sequences in Supplementary Data 10); poly-alanine (A n ) motifs in red, GPGX (X = G, S, A, or Q) in blue, and GPGPQ motifs in purple.c MaSp4a protein aligning consecutive ensemble repeats with C-terminal domain italicized, highlighting GPGPQ and GPGX as above

Fig. 2 1 4
Fig.2Amino acid composition and expression of Caerostris darwini major ampullate (MA) spidroin (MaSp) proteins consistent with MA gland and dragline fiber composition.a Amino acid composition of C. darwini MaSp and Flag proteins (percent five most abundant residues) based on translation of longest transcript for each (Supplementary Data 1, 3).b Percent amino acid composition of dragline (major ampullate) fibers (n = 3) against major ampullate gland composition (n = 1), five most abundant residues color coded as in part a (Glx = glutamine + glutamate).c Expression of spidroin transcripts in C. darwini major ampullate silk glands from two gland-specific RNA-sequencing (RNA-Seq) libraries.Expression measured in transcripts per million (TPM) in replicate individuals (cd46 and cd47), values listed in Supplementary Data 7. d Predicted glycine, alanine, and proline composition from expression data closely matched MA gland and dragline composition (symbols represent individual data points)

Fig. 4
Fig. 4 Darwin's bark spider (Caerostris darwini) spins largest orb webs using silk glands with unusually long spinning ducts.a Orb-web of C. darwini illustrating major web elements: web frameline and radial lines composed of dragline (major ampullate (MA)) silk, and capture spiral composed of flagelliform silk fibers coated with aggregrate silk glue.(b) Caerostris darwini MA silk gland showing region of silk protein secretion (tail), silk solution storage sac, and fiber synthesis site (S-shaped spinning duct), indicating two loops joining three duct "limbs."Another C. darwini individual (c) and MA gland of Argiope aurantia (d), with arrows pointing to second loop of spinning duct for comparison.Scale bars in b, d = 2 mm