The Nephila clavipes genome highlights the diversity of spider silk genes and their complex expression

Spider silks are the toughest known biological materials, yet are lightweight and virtually invisible to the human immune system, and they thus have revolutionary potential for medicine and industry. Spider silks are largely composed of spidroins, a unique family of structural proteins. To investigate spidroin genes systematically, we constructed the first genome of an orb-weaving spider: the golden orb-weaver (Nephila clavipes), which builds large webs using an extensive repertoire of silks with diverse physical properties. We cataloged 28 Nephila spidroins, representing all known orb-weaver spidroin types, and identified 394 repeated coding motif variants and higher-order repetitive cassette structures unique to specific spidroins. Characterization of spidroin expression in distinct silk gland types indicates that glands can express multiple spidroin types. We find evidence of an alternatively spliced spidroin, a spidroin expressed only in venom glands, evolutionary mechanisms for spidroin diversification, and non-spidroin genes with expression patterns that suggest roles in silk production.

More than 380 million years of evolution have produced >46,000 extant spider species, exhibiting an incredible diversity of silks used for prey capture and reproduction [1][2][3] . Spider silks can be stronger than steel and tougher than Kevlar, yet are much lighter weight than these manmade materials 4 . Silks vary in extensibility 5 , are temperature resilient 6 , can enable electrical conduction 7 , and can inhibit bacterial growth while being nearly invisible to the human immune system 8 . Thus, novel materials derived from spider silks offer tremendous potential for medical and industrial innovation. To take advantage of their desirable properties, we must learn more about spider silk genetic structure, functional diversity, and production.
A female orb-weaving spider can have up to seven morphologically differentiated types of silk glands, each believed to extrude a distinct class of silk with biophysical characteristics resulting from the expression of a unique combination of silk genes in that gland 9,10 . The silk classes of a typical 'gluey silk' orb-weaver (Araneoidea) female include (i) major ampullate silk, which exhibits great tensile strength and is employed in draglines, bridgelines, and web radii 11,12 ; (ii) minor ampullate silk, used for inelastic temporary spirals during web building 11,12 ; (iii) cement-like piriform silk that bonds fibers together and to other substrates 13,14 ; (iv) strong, yet flexible aciniform silk used for prey wrapping and egg case insulation 15 ; (v) tubuliform and cylindriform silk that constitutes the tough outer layer of egg cases 16,17 ; (vi) flagelliform silk that exhibits unparalleled extensibility and is used in the capture spiral 18,19 ; and (vii) the viscous and sticky aggregate silk that aids in prey capture [20][21][22][23][24] . Many spider species produce just a subset of these silk classes, and some produce yet other silk types, including cribellate silk 25 . Each species possesses an assortment of specialized gland types that are thought to produce distinct classes of silks to fit specific needs 9,26,27 .
Spider silks are composed primarily of spidroin proteins (where a 'spidroin' is a spider fibroin [28][29][30][31] ) that, by convention, have been named and classified according to the specific silk gland in which they were first discovered. Spidroin proteins have conserved N-and C-terminal domains that flank long runs of repeated motifs [32][33][34] , the composition and number of which confer specific physical properties to silks 27 . Yet, despite decades of research on orb-weaver silks, there is incomplete knowledge of all the spidroins within an orb-weaver species.
Adding to the sampling of sequences obtained from targeted investigations, the assembly of the velvet spider (Stegodyphus mimosarum) genome yielded 19 spidroins, the largest collection from any single species 27 . Owing to the challenges of assembling arrays of repeats, several of the S. mimosarum spidroin sequences are incomplete, without the sequences encoding N-and C-terminal domains anchored on a single scaffold [10][11][12] . Furthermore, this cribellate-sheetweb-building spider lacks the flagelliform and aggregate silks found in orb webs, limiting the diversity of spidroin sequences cataloged from a single The Nephila clavipes genome highlights the diversity of spider silk genes and their complex expression spider species. In contrast, female golden orb-weaver spiders (N. clavipes) use silks from all seven of the araneoid silk gland types 35 (Supplementary Fig. 1a-c). The first spidroins were characterized from N. clavipes [28][29][30][31] , and this species has continued to be useful for investigating silk genes, their diversity and structure, and their evolutionary history as a gene family 29 . Surprisingly, the full genome of this extensively studied species, the "ubiquitous workhorse of spider research" (ref. 2), has not been reported.
We present the first annotated genome of an orb-weaving spider, cataloging 28 N. clavipes spidroins, including 8 that were previously unreported. Characterization of the repetitive sequences found in these genes has yielded numerous novel motifs and new variants of previously reported motifs 9,10 . Many of these motifs occur in iterated groups, and we catalog as many as 506 unique 'cassettes' that feature two to four contiguous motifs and that are themselves organized into larger repetitive units (~200 amino acid residues) known as ensemble repeats 30,36 . The N. clavipes genome provides evidence for evolutionary mechanisms like tandem duplication that may underlie spidroin diversification, and our data support estimates that rapid silk evolution accompanied the emergence of the orb web ~213 million years ago 37 .
We used the results of our genome-wide approach to profile transcripts from all loci in tissues isolated from N. clavipes females. Using quantitative expression analysis of the 28 N. clavipes spidroins in this spider's morphologically distinct silk glands, we have examined the idea that spiders have evolved multiple types of silk glands that produce unique combinations of silk proteins, usually with one or two spidroins dominating [38][39][40][41] (Supplementary Fig. 1b). Our complete expression profile of N. clavipes spidroin transcripts across the set of silk glands reveals the fuller extent of this phenomenon. We demonstrate that a novel N. clavipes spidroin is expressed exclusively in venom glands rather than silk glands, a radical change in gene regulation. We detect alternative splicing of a spidroin transcript, a mechanism conjectured for spidroins 31 but not, to our knowledge, previously shown. We also identify non-spidroin genes that are highly expressed in silk glands, suggesting these genes as candidates for further study of spider silk production.

An annotated genome for N. clavipes
We sequenced genomic DNA isolated from field-collected N. clavipes females and used a combination of strategies to de novo assemble 2.44 Gb of genome ( Table 1, Supplementary Tables 1-4, and Supplementary Note). The predicted size of the entire N. clavipes genome is 3.45 Gb, with 55% estimated as repetitive sequence. Our annotated meta-assembly consists of 180,236 scaffolds (N50 scaffold size, 62.9 kb; N50 contig size, 8.1 kb), with 98.5× coverage from remapping of over 2.48 billion 100-bp unique reads (48.9× from unique pairs; Supplementary Table 5).
To determine gene locations within the N. clavipes genome, we sequenced RNA from 16 different tissue isolates (for example, whole body, brain, and individual silk and venom glands) collected from four female individuals and then de novo assembled the transcriptome for each isolate using strand-specific 100-bp paired-end reads (Supplementary Note). We also assembled a transcriptome representing the union of all isolates (1.53 billion reads; Table 1 and  Supplementary Tables 6 and 7). To quantify the completeness of the protein-coding genome, we searched our draft assemblies for homology to >2,000 curated arachnid sequences 42 , and we estimate that our draft genome is 94% complete and our all-isolate transcriptome assembly is 99% complete (Supplementary Tables 5 and 7).
In total, >32 million features are annotated on the N. clavipes draft genome (Supplementary Table 8), which was achieved using (i) our transcriptome from all isolates, (ii) results from two gene prediction algorithms, (iii) libraries of transposable elements and repeated motifs, and (iv) coding sequences from related species 18,19 (Supplementary Table 8). Using gene modeling, we conservatively predict 14,025 genes present in the N. clavipes genome; 2,023 gene models transcribe >1 alternative spliceoform, resulting in 3,937 additional transcripts for a total of 17,962 mRNAs in the final gene set (Supplementary Table 9).

A first-generation araneoid spidroin catalog
To identify N. clavipes spidroin genes, we searched the assembled genome, transcriptomes, and annotated gene models for sequences similar to published spidroins (Supplementary Table 10), finding 28 candidates. We used long-range PCR followed by single-molecule real-time sequencing to reconstruct and validate each assembled locus at >100× coverage (Supplementary Table 11 and Supplementary Note). For 27 of the 28 spidroins, the sequences encoding the N-and C-terminal domains were connected on a single scaffold (Fig. 1). We obtained 20 complete full-length spidroin sequences, and, while gaps persist in the remaining spidroins, substantial portions of their repeated motif structures are described (Fig. 1). We note three partial spidroin-like sequences that could not be assembled, suggesting that there are additional N. clavipes spidroins yet to be characterized (Supplementary Table 12).
To assign correspondence between our N. clavipes spidroins and those previously described, we performed alignments of conserved N-and C-terminal sequences and internal motifs reported to be specific to a particular spidroin class [43][44][45] . Most of our N. clavipes ) eluded assignment to the known classes on the basis of these alignments, suggesting the existence of additional spidroin classes or that class boundaries are less defined by sequence than previously assumed ( Fig. 1 and Supplementary Figs. 2 and 3).
To catalog repetitive elements found within N. clavipes spidroins, we performed computational motif discovery and labeling 43 , followed by searches for larger repetitive structures (Supplementary Note). We observed 394 unique motif variants, ranging from 4 to 34 amino acids in length (Fig. 2a). In addition to previously reported motifs, hundreds of the N. clavipes motif variants were completely novel and others were new variants of previously documented motifs ( Fig. 2a and Supplementary Table 13). Arrays of repeated motifs spanned 50-96% (median 81%) of the internal coding lengths of the 20 complete spidroins (Fig. 2b). The overall diversity and complexity of these repetitive structures were greater than expected, considering previous reports 29 .
To better understand their diversity, we organized the unique motif variants into 49 motif groups on the basis of homology comparisons. One motif group consisted of GXGGX-containing motif variants, including the well-known motif variant GPGGY 18,29 (Fig. 2a). Polyalanine motif variants 29 , four to ten residues in length, were grouped with novel polyalanine-like motif variants that also contained other residues (Supplementary Table 13). Meanwhile, one of the most frequently occurring motif groups was the novel DTXSYXTGEY group. Confined to two aggregate and one unclassified spidroin, three variants of this motif cumulatively occurred 554 times. Other frequently occurring novel motif groups included (i) GPGTTPGTI, (ii) multi TTX, (iii) multi GL, (iv) multi SQ/XQQ, and (v) non-alanine homopolymer runs. MaSp-g contained 73 different unique motif variants from 20 motif groups, the largest number observed in N. clavipes (Fig. 2b). AgSp-d contained the longest array (n = 546) of repeated motif occurrences (Supplementary Fig. 7). MaSp-f, MaSp-g, and MiSp-c displayed the greatest diversity, with motifs representing 20 of the 49 motif groups ( Fig. 2b and Supplementary Table 13).
In the N. clavipes catalog of spidroin sequences, 46 of the 49 motif groups (260 of the 394 motif variants; 66%) were found in multiple spidroin genes (Figs. 2b and 3a). Strikingly, 204 of the 260 (78%) shared motifs were found in multiple silk classes. Having motifs in common appears to be a prevalent feature of spidroins ( Fig. 3a-e), with MaSp-g containing the largest number of shared motifs (n = 63; Fig. 3a). We noted enrichment of shared novel motifs among the aggregate and unclassified spidroin classes (Fig. 3d). FLAG-a and the new flagelliform, FLAG-b, shared repeated motifs with all spidroins, but, curiously, both displayed less sharing with the aggregate spidroin AgSp-d (Fig. 3f,g). The DTXSYXTGEY motif was found predominately in AgSp-d, suggesting that this motif may confer some of the distinctive properties of this putatively sticky, non-fibrous spidroin.
We also observed second-order repetitive organization of repeated motifs 43,47 . We defined a cassette as the tandem occurrence of unique motif variants repeated two to four times across spidroin sequences, and these cassettes were often organized into larger ensembles 32,48 . Cassettes were present in all N. clavipes spidroins (Fig. 4a,  Supplementary Fig. 8, and Supplementary Table 14). We identified 506 different cassette types and 1,440 occurrences (Fig. 4a), spanning 25-95% of the motif arrays in the 20 full-length spidroins ( Fig. 4b and Supplementary Fig. 8). Our catalog of cassettes included documented combinations of motifs such as XGGXGGX + polyalanine, GPG + polyalanine, and GPG + GXGGX 9,49   We examined the extent of cassette sharing across spidroins. Of the 506 distinct cassettes, 480 (95%) were private to individual genes, in striking contrast to the extensive sharing of motifs ( Fig. 4a,b). Cassette sharing existed mainly in the major and minor ampullate classes, but these genes still contained substantial numbers of private cassettes. Ten spidroins (MaSp-a, MaSp-d, MaSp-f, TuSp, AcSp, Sp-14910-B, FLAG-b, Sp-8175, AgSp-a, and AgSp-b) contained only private cassettes (Supplementary Fig. 8). These observations support the idea that shared motifs assembled into distinct private cassettes may differentiate spidroin gene functionality, conferring the different physical properties of silks 30,36 .

Expression profiling of individual spidroins across multiple tissues
Previous experimental findings have suggested that spidroin expression is not exclusively gland specific [50][51][52][53][54] . Our RNA sequencing studies also suggested broader patterns of spidroin transcript expression across silk glands (Supplementary Fig. 9 and Supplementary Note). To better understand the regulation of spidroin genes and the mechanisms of silk gland specialization, we directly interrogated the degree of spidroin expression bias by using qPCR to measure the RNA transcript levels of the 28 N. clavipes spidroins in morphologically classified silk glands and control tissues collected from three adult females ( Supplementary Fig. 1b and Supplementary Note). We prepared three cleanly separated isolates of all morphologically distinct gland types except for the aciniform and piriform glands, which because of their proximal anatomical locations could not be cleanly separated and were therefore treated as a combined sample ("other silk glands"; Supplementary Note). In every silk gland assayed in our experiments, spidroin transcripts belonging to more than one silk class were detected (Supplementary Figs. 10 and 11). As expected from the bulk of previous studies, we found examples of spidroin genes from each class that were highly expressed in their corresponding morphologically distinct silk gland (for example, MiSp-c in minor ampullate gland; Fig. 5a). Our data also identified several cases in which spidroin transcripts were expressed abundantly in glands that did not correspond with the gene name (for example, MaSp-h in tubuliform gland; Fig. 5a). Some spidroins appeared to be expressed in all of the silk glands assayed (for example, AgSp-d; Fig. 5a). However, it is important to note that spidroin gene names have historically been conferred on the basis of the gland from which the gene was first cloned and do not assume exclusivity in expression. While we detected transcripts for the tubuliform spidroin TuSp in several silk glands, the expression of this gene was not highest in the tubuliform gland (Supplementary Figs. 10 and 11), possibly because the adult females from which the tissue samples were collected were not in the process of preparing to spin egg casings or might have recently done so 32,48 . Several genes showed strong expression in a particular gland type, providing clues regarding their function (for example, Sp-5803 in flagelliform gland; Fig. 5a). When viewed as a profile across silk glands, we found that many of our unknown spidroins showed patterns of expression similar to those of members of known silk classes, providing hypotheses for their functionality (for example, the profile of Sp-8175 expression strongly correlated with the expression profile of AgSp spidroins, and the profile for Sp-74867 correlated with that of MaSp spidroins; Fig. 5b).
It is generally assumed that spidroin expression is confined to silk glands, but this was not the case for novel FLAG-b (Fig. 5c). In fact, the highest abundance of FLAG-b transcripts was detected in venom glands, a finding consistent with results from our RNA sequencing studies (Supplementary Fig. 9). PR-1 (a known venom toxin 55 ) and FLAG-a (the established flagelliform spidroin) transcripts were enriched in the expected tissues and served as controls (Fig. 5c). Normalized FLAG-b transcript levels were ~1,000-to 5,000fold higher in venom gland than they were in silk glands (Wilcoxon rank-sum test, P = 0.00075).

Extreme spidroin diversity and evolutionary origins
The multiexon structure of some spidroins has led to conjecture that alternative splicing may increase transcript diversity 9,49 , although evidence   A r t i c l e s for spidroin spliceoforms has not previously been confirmed experimentally 30,36 . We detected split reads that are evidence of alternative splicing of MaSp-f transcripts into two spliceoforms: the major full-length isoform and a minor isoform lacking most of the second exon (Fig. 5d). Given that the second exon of the full-length isoform encodes the putative C-terminal domain, which has been shown in other MaSp spidroins to act as a switch between storage and assembly forms of silk proteins and as a facilitator of protein organization in silk formation 56 , this raises the possibility that the second, truncated MaSp-f isoform transitions or organizes in a different manner than the full-length isoform. To identify non-spidroin candidate genes potentially involved in silk production, we cataloged transcripts that were (i) highly expressed and/or (ii) uniquely expressed in N. clavipes silk glands, resulting in a list of 649 candidates from our RNA sequencing data ( Supplementary Fig. 12 and Supplementary Table 15); 183 of these genes exhibited homology to documented silk gland-specific transcripts [50][51][52][53][54] . The candidates included catalytic enzymes such as kinases, proteases, dehydrogenases, acetyltransferases, and synthases, many of which are active in eukaryotic secretory systems 57 . We expect this catalog to include genes encoding proteins involved in the conversion of liquid silk dope to solid silk thread 52,53 , such as enzymes that maintain the pH gradient along the gland body to spigots on the spinnerets 50,54 . Candidates that might generate ions for the pH gradient include three carbonic anhydrase orthologs (Ca10, Ca13, and Ca14), four thyroid peroxidase (Tpo) paralogs, and five chorion peroxidase (Pxt) paralogs (Supplementary Table 15).
We found evidence for at least two different evolutionary mechanisms that might contribute to the diversification of spidroin loci. First, we saw evidence of new spidroin genes originating from  -14910-B). We note here that the profiles of expression across silk glands between the pair-mates of the two pairs of tandem spidroins were strongly correlated (Fig. 5b). Second, we noted a ~2-fold higher level of polymorphism in sequences encoding the N-and Cterminal domains of spidroins (mean θ W = 0.002) in comparison to levels observed across the coding genome (mean θ W = 0.0009; Wilcoxon rank-sum test, P = 5.4 × 10 −6 ) (Supplementary Fig. 13a-c). This is a signature consistent with long-term balancing selection occurring at spidroin loci.

DISCUSSION
To translate the remarkable properties of spider silks into innovative medical and industrial applications, it is necessary to expand knowledge of the diversity of the underlying gene structures, the relationships between structure and function, and final product synthesis. By generating the first annotated genome of an orb-weaving spider, we have taken an important step toward these ends. Our efforts-made possible by previous studies that documented spidroin sequences from several spider species 18,27,43 -have enabled us to catalog an extensive collection of 28 spidroins representing the full spectrum of araneoid silk classes, identifying 8 previously unreported spidroins and providing a wealth of new repetitive elements. Complemented by silk gland-specific expression profiling, this collection of data greatly expands understanding of spidroin gene diversity, structure, and expression across morphologically distinct silk glands. Every silk gland that we profiled was found to express broad combinations of spidroin transcripts representing multiple classes. Together with analogous non-gland-specific expression patterns seen for individual spidroins in other spider species 27,53 , these data argue for complex, gland-specific models of spidroin expression and silk production. Future studies that measure the levels of spidroin proteins in silk glands would be useful to validate the correspondence of transcript abundance to translational output and will provide an additional dimension to understanding of silk production.
The generation of full-length sequences and assemblies has shed light on the evolutionary mechanisms acting on the N. clavipes spidroin gene family. Evidence of tandem gene duplication and high levels of polymorphism suggests that spidroins are naturally selected to maintain diversity. The presence of long arrays of tandem repeats in spidroins, reminiscent of those that facilitate rapid variation on microbial cell surfaces 61 and morphological plasticity in mammals 62 , also suggests a gene family undergoing continuous evolution. In particular, N. clavipes spidroin gene phylogenies ( Fig. 1 and Supplementary  Figs. 2 and 3) provide evidence for the shared origin and nearly simultaneous expansion of the entire flagelliform and aggregate spidroin classes, supporting expectations that they are tightly linked to one another functionally and to the origin of the viscid orb web 32 .
Highlighting the diversity that has evolved within the spidroin family, perhaps the most remarkable example reported here is the novel flagelliform gene FLAG-b. This gene's sequence similarity and phylogenetic affinity to canonical flagelliform FLAG-a 18 (Supplementary  Figs. 2 and 3) suggest that it arose as a second genomic copy from a duplication event, yet FLAG-b transcripts are highly abundant in venom glands. This discovery is reminiscent of a puzzling detection of peptides from two spidroin-like proteins (S.m. Sp1 and S.m. Sp2b) in the venom gland of the velvet spider 27 . Our expression data add clarity to this issue, suggesting that FLAG-b has evolved functions beyond silk-related applications in N. clavipes, and refute the idea that spidroin expression is restricted to silk glands. As such, this flagelliform may represent a new kind of venom gland-expressed spidroin (VeSp). This observation suggests promising avenues for future research on links between spider silk and venom, both composed of complicated proteins whose production is a functional synapomorphy for the order Araneae. In this context, the spitting spider (Scytodes sp.) is particularly interesting, as it is the only spider to exude from its chelicerae fibrous 'venom' that is used to immobilize prey by gluing it to a substrate [63][64][65] . However, a recent   transcriptomic and proteomic investigation did not find evidence of spidroins in the venom of Scytodes thoracica 66 , suggesting a different evolutionary route for this functional convergence. Proteomic studies could demonstrate whether FLAG-b is indeed translated into a protein in N. clavipes venom glands. If so, this raises the intriguing possibility that FLAG-b functions as a buffer, chaperone, adhesive, or preservative for the smaller bioactive compounds found in N. clavipes venom and thus may provide a novel use of spidroins in human medical applications. Systematic characterization of the diversity, number, and structure of the repetitive elements found in N. clavipes spidroins yields the key observations that 66% of repeated motif variants are shared both within and across spidroin classes, whereas 95% of cassette variants are private to individual spidroins. Together, these observations suggest that the assembly of shared motifs into distinct cassettes, often organized into larger repeat ensembles 29 , may underlie the range of unique biophysical characteristics observed for the various spider silk classes-an idea also supported by recent results from transgenic spidroin expression in silkworm 33,67 . Given the relationships already observed between spidroin motif sequences and structural properties of silk proteins 34 , the extensive catalog of novel motifs and combinations reported here provides many candidates for transgenic studies. Our catalog can provide deeper understanding of the interplay between silk genes, silk protein structure, and the biomechanical properties of silks and will underlie future efforts to capture the extraordinary properties of spider silks in manmade materials.

METhODS
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.

ONLINE METhODS
For expanded methodological details, please refer to the separate Supplementary Note, Supplementary Figures 1-13 De novo genome assembly. Raw FASTQ read files were evaluated using FastQC (v0.11.2) and then trimmed using Trimmomatic (v0.32) 68 to remove adaptor read-through, low-quality bases, and ambiguous base calls. All jumping matepair DNA libraries were processed using the program FastUniq (v1.1) 69 to remove duplicate read pairs. The N. clavipes genome was assembled de novo using a meta-assembly approach. Two draft assemblies were constructed in parallel using AllPaths-LG vR49967 (ref. 70) and SOAPdenovo2 (v2.04) 71 and were then merged using Metassembler (v1.1) 72 . Genomic quality metrics were calculated for all N. clavipes assemblies using scripts from the Assemblathon 2 competition 73 , available at https://github.com/ucdavis-bioinformatics/assemblathon2-analysis/. To assess the genome's functional 'completeness' , the Benchmarking Universal Single-Copy Orthologs (BUSCO) gene mapping method 74 was also applied to all N. clavipes assemblies to identify conserved proteincoding genetic loci. All single-copy gene sequences from Ixodes scapularis (deer tick) were extracted from the BUSCO Arthropod gene set, a 95% refinement cutoff was applied, and 2,058 I. scapularis loci were used to query the completeness of all intermediate and final N. clavipes assemblies (Supplementary Table 5), as well as the all-isolate transcriptome assembly described below (Supplementary Table 7).
See the Supplementary Note for additional details.
De novo transcriptome assembly. After quality control and filtering of reads, all RNA libraries were de novo assembled together as a primary all-isolate transcriptome using Trinity (rel_2.25.13) 75,76 (Supplementary Table 5).
Meanwhile, 16 tissue-specific transcriptomes were individually de novo assembled using Trinity (Supplementary Tables 1, 5, and 6). All transcripts were aligned back to the genome using the splice-aware mRNA/EST aligner GMAP (rel_10.22.14) 77 , and reads from each RNA library were aligned to the genome using RNA-STAR (2.4.2a) 78 (Supplementary Table 6). See the Supplementary Note for additional details.
Genome annotation. Genomic features were defined on the final N. clavipes meta-assembly using four successive rounds of the annotation pipeline Maker2 (ref. 38). Repetitive regions were identified using RepeatRunner (supplied with Maker2), RepeatMasker v4.0.5 with RMblast, and RepBase repeat libraries 79 and subsequently masked for downstream gene modeling. Tandem repeats were identified using Tandem Repeats Finder (v4.07b) 80 . Gene models were based on multiple types of evidence: alternate species protein sequence alignments, alternate species EST/mRNA/cDNA sequence alignments, de novo-assembled transcripts from N. clavipes RNA-seq experiments, and ab initio gene predictions. Protein and EST/mRNA sequences were collected from online databases (Supplementary Table 8). Exon boundaries were marked using Exonerate (v2.2.0) 81 , and tRNAs were identified by tRNAscan-SE 82 . Feature boundaries were further polished by Maker2, directing successive rounds of trained predictions from SNAP (rel_11.29.13) 39 and Augustus (v3.0.2) 40 (WebAugustus 41 ). In total, >32 million genomic features and 403,888 putative genes were modeled on the final N. clavipes annotated meta-assembly (Supplementary Table 9). Putative gene model identities were established by the reciprocal alignment of model protein sequences to the UniProtKB/SwissProt 83 protein database (v6.3.15) using BLASTP 84 and Maker2 accessory scripts.
Five tiers/sets of gene models with increasing stringency were defined on the basis of agreement among coding feature annotations, conserved protein domains, eukaryotic gene structure, and similarities with curated gene databases (Supplementary Table 9). The 'standard' gene set (54,186 genes, 58,132 mRNAs) contained only gene models that possessed known protein domains from the InterPro Pfam database 85 and was produced using BLASTP 84 Fig. 2). The intraspecific N. clavipes spidroin phylogeny was similarly constructed using Geneious, Clustal, and BLOSUM62 (ref. 86), and unrooted spidroin trees were built with PhyML and BLOSUM62 with bootstrap values based on 1,000 replicates (Supplementary Fig. 3 Spidroin gene repeat motif identification and analyses. All N. clavipes spidroins were translated into amino acid residues and then subjected to repeat motif identification using MEME and motif painting with MAST (v4.10) 46 .
Repetitive motifs were manually curated to remove low-quality hits and motifs occurring in N-and C-terminal domains (using a hard cutoff of 100 residues) and were cataloged as unique 'motif variants' ranging from 4 to 87 residues in length. Motif variants were then organized into 'motif groups' on the basis of residue content and sequence, following the rules listed in Supplementary  Table 13. Motifs that could not be informatively grouped were designated as 'unassigned' . The full catalog of motif variants was input in secondary rounds of motif searching with the custom pipeline Spider_pipeline.py, available at https://github.com/danich1/Spider-Pipeline (Figs. 2a,b and 3a-d, and Supplementary Fig. 7). Next, the pipeline was used to search for higherorder repetitive structures denoted as 'cassette variants' , defined as two to four adjacent motif occurrences that were enriched across spidroins. Cassette variants were organized into 'cassette groups' (Supplementary Table 14) and curated to remove cassette types that exhibited inter-motif gaps of >20 residues or that occurred in N-and C-terminal domains (Fig. 4a,b and Supplementary Table 8).
See the Supplementary Note for additional details.
Amino acid content analyses. To provide a background level of residue content for the N. clavipes coding genome, the proportion of each of the 20 different amino acids was computed for each of the 17,989 translated mRNA sequences from the gold gene set. The same was done for the 28 N. clavipes spidroin sequences (Supplementary Fig. 6). To test for significant enrichment of amino acid types, the distribution of each residue's proportions of non-spidroins was compared to those of spidroins using two-tailed unequal-variance Wilcoxon rank-sum tests. See the Supplementary Note for additional details.
Polymorphism levels across the N. clavipes genome. To quantify polymorphism in the N. clavipes genome, all fragment sequencing reads from a single individual (Nep-004) were remapped to the genome using BWA-MEM 90 and variant calling of SNPs and small indels was performed using SAMtools 91,92 .
Variants were hard-filtered to include only SNPs meeting minimum quality (20) and depth (20) thresholds and then subdivided into 14 categories (genome, noncoding, genes (gold set), CDS, mRNAs, 3′ UTRs, 5′ UTRs, exons, introns, gold N termini, gold C termini, spidroin genes, spidroin N termini, spidroin C termini) using VCFtools 93 . SNPs were counted for each category, and polymorphic levels were assessed on the basis of heterozygosity, number of segregating sites, SNP rate, and Watterson's estimator of theta (θ W ) 94 . The distribution of θ W values of non-spidroins was compared to that of spidroins using the Wilcoxon rank-sum test. See the Supplementary Note for additional details.
Expression and alternative splicing analyses. To compare the relative abundance of N. clavipes gene models across the 16 different tissue isolates, reads from each RNA library were aligned to the final N. clavipes annotated metaassembly using STAR (v2.4.2a). Next, the PORT v0.7.3 expression pipeline (https://github.com/itmat/Normalization) was applied to normalize and quantify the RNA-seq data between the libraries.
To identify putative alternatively spliced transcripts that existed among the gold genes and spidroins, the PORT pipeline was run in exon/intron mode to quantify reads mapping to genomic features at the splice-junction level. Normalized counts of the split-RNA reads mapping across each junction were summarized at each locus of putative alternative splicing. Proportions of split reads mapping to different alternative junctions were then calculated for each tissue type (Fig. 5d).
See the Supplementary Note for additional details.
qPCR analysis. To test the relative expression of spidroin loci in discrete anatomical subsections, including specific silk gland types, qPCR analysis was performed with RNA transcripts isolated from three additional mature female N. clavipes individuals from Citrus County, Florida, USA. From the abdomen, silk glands were identified by relative position and morphology and then individually collected by severing their ducts near the spinnerets. From the cephalothorax, legs were collected, venom glands were collected after separation of the chelicerae from the cephalothorax, and the remaining cephalothorax tissue was retained as the 'head' sample. In total, each specimen was microdissected into 9 tissue subsections-venom glands, head (with no venom glands), legs, major ampullate silk gland (MA), minor ampullate silk gland (MI), flagelliform silk gland (FL), aggregate silk gland (AG), tubuliform silk glands, and 'other silk glands' (OTHER: piriform and aciniform glands, attached to the spinneret), yielding 27 experimental samples in total (Supplementary Table 1). RNA was extracted using TRIzol (Ambion, Life Technologies) and RNeasy Mini kit spin columns (Qiagen), and additional cleanup was performed using the RNA Clean & Concentrator-5 kit with DNase I treatment (Zymo Research). Small aliquots (~5 µl) were used for quality control and quantification. cDNA was produced from each RNA sample with a High-Capacity cDNA Reverse Transcription kit (Life Technologies) and run alongside multiple and 'no reverse transcriptase' (NRT) negative controls. Primers were designed to target 30 loci (all 28 spidroins, 1 venom locus (CRiSP/Allergen/PR-1) 55 , and 1 housekeeping gene (RPL13a) 95 ), as well as 22 genomic scaffold controls for all single-exon spidroin genes (Supplementary Table 11). qPCR reactions were set up in triplicate using standard SYBR Green PCR Master Mix (Life Technologies) and run on a ViiA 7 Real-Time PCR machine. Relative transcript abundance of targets in silk and venom gland samples was normalized to that of leg tissue samples and calculated using the 2 −∆∆C T method 96 (Fig. 5a-c and Supplementary Figs. 10 and 11). Co-expression scores were calculated using Pearson correlation of relative expression values ( 2 −∆∆C T ) for each pair of genes and plotted using single-linkage hierarchical clustering (Fig. 5b). See the Supplementary Note for additional details and the Supplementary Data for the complete listing of relative expression values (2 −∆∆C T ) for all replicates.
Identification of non-spidroin silk gland-specific transcripts. The normalized expression data set of 14,025 'gold' genes was filtered to identify putative SST loci that could be categorized as (i) 'HighInSilk' , with >1,000 absolute normalized mapped RNA reads in ≥1 silk gland and <200 reads in non-silk tissues; (ii) 'ExclusiveToSilk' , with >100 absolute normalized mapped RNA reads in ≥1 silk gland and zero reads in non-silk tissues; (iii) 'GlandEnriched' , with >400 absolute normalized mapped reads in only a single silk gland and <350 in all non-silk tissues; and (iv) 'Literature' , corresponding to the BLASTP 84 homologs (e value: ≤ 1 × 10 −6 ) of 282 unique SSTs from studies of spider silk gland transcriptomes and proteomes 50-54 plus peroxidase or anhydrase gene family members hypothesized to be involved in silk production 50,54 (Supplementary Fig. 12 and Supplementary Table 15).
See the Supplementary Note for additional details.