The bagworm genome reveals a unique fibroin gene that provides high tensile strength

Arthropod silk is known as a versatile tool, and its variability makes it an attractive biomaterial. Eumeta variegata is a bagworm moth (Lepidoptera, Psychidae) that uses silk throughout all life stages. Notably, the bagworm-specific uses of silk include larval development in a bag coated with silk and plant materials and the use of silk attachments to hang pupae. An understanding at the molecular level of bagworm silk, which enables such unique purposes, is an opportunity to expand the possibilities for artificial biomaterial design. However, very little is known about the bagworm fibroin gene and the mechanical properties of bagworm silk. Here, we report the bagworm genome, including a silk fibroin gene. The genome is approximately 700 Mbp in size, and the newly found fibroin gene has a unique repetitive motif. Furthermore, a mechanical property test demonstrates a phylogenetic relationship between the unique motif and tensile strength of bagworm silk.

S ilk is a natural protein fiber and a versatile biomaterial widely observed in arthropods. The ability to produce silk has independently evolved in different clades in arthropods, mainly including the classes Insecta and Arachnida 1,2 . Silk is broadly used throughout species-specific situations, from reproduction, such as cocoon and egg sac formation or mating rituals to foraging, such as prey capturing, jumping, ballooning, or for shelter 1,3,4 . Such outstanding various properties and broad versatility are representative of the diversity in evolutionary and ecological adaptations of these clades of organisms.
Bagworm (Lepidoptera: Psychidae) is one of the insects that uses silken thread throughout its life stages from larvae to adult. Although the silkworm (Lepidoptera: Bombycidae) and the saturniid (Lepidoptera: Saturniidae) are closely related species, the bagworm has a slight difference in the use of silk. The bagworm family (Psychidae) includes over 1,000 species, and all of their larval development is conducted within a self-enclosing bag 5 . This bag is different from a cocoon made only of silk because it is combined with plant materials, and it also works as armor against invertebrate predators 6,7 . Remarkably, although the males are fully winged in this family, the females are categorized as a winged, vestigial, or vermiform type 8 . Since the adult females in over half of species are apterous, they remain in their protective bag, and the copulation is also performed on or within the bag 5 . Furthermore, the silken thread is used for dispersal by ballooning in the freshly hatched larvae or as an attachment with a high pulloff force 9 to substrate for suspending the bag on the leaf to avoid predation during pupation 6 . These mechanical and functional properties unique to the bagworms make them an ideal case for comparative analysis, especially in light of their phylogenetic neighbors silkworms and saturniids. However, little study about the bagworm silk fibroin gene or its mechanical properties has been performed, even within the superfamily Tineoidea. It is extremely difficult to characterize novel fibroin gene sequences, because the fibroin genes are generally very long (exceeding 10,000 bp) and are almost entirely comprised of repetitive sequences, and PCR amplification or even reverse transcription of such sequence structure often results in chimeric artefacts. Hence, the finding of such gene required a high quality genome assembled with a long read sequencing of the unamplified single molecule DNA.
Therefore, here we present the first draft genome of Eumeta variegata (Lepidoptera, Psychidae) assembled purely from nanopore sequencing reads. E. variegata is the largest and one of the most common bagworms living in Japan. This is also the first report about the bagworm silk fibroin gene. On the basis of the first draft genome, we demonstrate the relationship between the bagworm fibroin gene and silk properties with a comparative analysis of the genomic phylogeny, fibroin gene architecture, and silk mechanical properties among closely related moths.

Results
Draft genome of the bagworm moth (Eumeta variegata). We report the draft genome of the bagworm moth constructed from the genomic DNA of E. variegate larvae. The draft genome was assembled using nanopore reads only and was subsequently polished using Illumina reads. Nanopore technology was used to sequence approximately 6.56 Gbp in 1.72 million long reads with an N50 length of 15 kbp (Supplementary Table 1 Table 2). The genome assembly quality was assessed by BUSCO v. 2, and the complete BUSCO score was 94.09%, with low redundancy rate 1.8% (Table 1). Since the gDNA-and cDNA-seq mapping rates were 90.0% for Illumina gDNA reads properly mapped in pair and 88.2% for nanopore gDNA reads, and 83.8% cDNA-seq reads (Table 1), we consider the genome assembly to be comprehensive.
The gene content within the E. variegata genome was analyzed by cDNA sequencing. The cDNA was prepared from mRNA samples obtained from two whole bagworm larvae (Supplementary Table 3). A total of 65.16 million 150 bp paired-end reads were sequenced. Based on a gene model constructed using cDNA sequencing data, 104,026 open reading frames (ORFs) without isoforms were predicted, and up to 36,444 proteincoding genes were estimated according to the expression profile and functional annotation ( Table 2). The quality assessment of gene prediction was performed by BUSCO v. 2, and the test showed 90.62% (complete + partial) using an Arthropoda gene set (Table 2).
Transcriptome annotation. Predicted gene sets were annotated according to NCBI non-redundant (nr) database and Pfam-A using BLAST and HMMER searches. As a result, 10,431 genes were annotated with an E-value cutoff of 1.0 e −5 . Among these annotated genes, the predominant positive matches were from Arthropoda, particularly family Noctuidae (18.55%), family  Papilionidae (15.62%), family Bombycidae (9.38%), family Nymphalidae (8.70%), and family Pyralidae (7.43%), and they are closely related families belonging to the order Lepidoptera ( Fig. 1a, Supplementary Data 1). We performed further analysis of these genes using Gene Ontology (GO). In this E. variegata genome, the GO slim terms in biological process, cellular component, and molecular function were associated with 11,569 genes, 40,936 genes, and 25,944 genes, respectively (Fig. 1b). Among biological process, transmembrane transport (11.30%) was the most abundant. Regarding cellular component, nucleus (21.10%) and cytoplasm (18.40%) were the most abundant. In the detailed terms of molecular function, DNA binding (22.48%) was the most abundant ( Fig. 1a, b). Furthermore, predicted genes were annotated by KEGG Orthology (KO), and the biological activity was analyzed using Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) v. 2.1 with default parameters. As a result, 12,874 E. variegata genes were assigned to 398 different KEGG pathways ( Table 3).
The fibroin gene in bagworms. A part of the silk fibroin gene was found by a similarity search with previously reported silkworm silk genes (Samia ricini: AB971865). In general, the structure of the silk fibroin gene has a very repetitive domain 10 . The bagworm silk fibroin gene was no exception, and 8.2 kbp Nterminal and 10.2 kbp C-terminal regions including highly consistent repetitive motifs were obtained by computational and manual assembly (Fig. 2a). Furthermore, the direct-DNA sequencing of genomic DNA suggested that the bagworm fibroin gene is extremely long, over 18.69 kbp in length.
Although this structural feature of the fibroin gene is shared and N-terminal regions are conserved among other lepidopterans, Cterminal regions and repetitive motif varied (Fig. 2b). A polyalanine [A n ] motif in the silkworm (family: Bombycidae) or an alternating glycine-alanine [(GA) n ] motif in the saturniid (family: Saturniidae) form a β-sheet structure and are well known as the typical repetitive motif. Interestingly, the bagworm fibroin gene combines both motifs in their repetitive domain (Figs. 2 and 3) 11 .
Comparison of fibroin genes on the phylogenetic tree. To compare the bagworm fibroin gene phylogenetically, we constructed a phylogenetic tree of the order Lepidoptera (Fig. 3). This phylogenetic tree included 45 species from 33 families and was constructed based on 465 ortholog gene set 12 . The cricket (Gryllus texensis) in the order Orthoptera was used as the out group. According to a previous study, the bagworm (family: Psychidae) belongs to the early-originating clade Ditrysia and is located far from the clade Macroheterocera, which includes Bombyx mori (family: Bombycidae) or Antheraea yamamai (family: Saturniidae) 13 . The same phylogenetic relationship was confirmed using the constructed phylogenetic tree. Heatmap panels of amino acid frequency in the repetitive domain are drawn next to the species harboring a known fibroin gene ( Fig. 3; Supplementary Data 2 and 3). As described above, the bagworm fibroin gene has [A n ] and [(GA) n ] motifs in the same repetitive domain. Such motif features were also shown in the heatmap. Additionally, bagworm fibroin has a GGYG motif in their repetitive domain, and tyrosine content was 2%. The tyrosine is widely observed in silkworm or spider silk, and the a b  amino acid functions in modulating intermolecular selfassembly or in preserving the silk 14,15 .
Mechanical properties of silks. The repetitive motif in bagworm fibroin contained a characteristic motif combining the properties of both the silkworm and the saturniid. Therefore, we next compared the mechanical properties of the silks from these organisms. Bagworm silks used for the mechanical property test were directly collected from bagworm larvae. In this study, the following parameters were measured as the mechanical properties: tensile strength (MPa), extensibility (mm/mm), Young's modulus (GPa), and toughness (MJ/m 3 ). These physical properties of silk were characterized by mechanical testing and are summarized in Fig. 4a Table 4; Supplementary Data 4). Each mechanical property score of silkworm and saturniid was used from a previous study 11 . Upon comparison, the bagworm silk showed a significantly higher tensile strength than the Japanese silkworm and all saturniids. On the other hand, the extensibility of the bagworm silk was significantly less than that of the A. yamamai. The scanning electron microscope image and the wide-angle X-ray scattering (WAXS) pattern of bagworm silk fibers are represented in Fig. 4b, c. The degree of crystallinity was 13.4%. The WAXS result indicates that beta-sheet structure is predominant in bagworm silk, which is similar to silkworm and spider silks 11,16 . Based on the intensity and spot sizes, the orientation and amount of beta-sheet crystals did not differ significantly.

Discussion
In order to obtain the bagworm silk gene sequence, we first prepared the high quality draft genome information. 36,444 genes were computationally predicted from our E. variegata genome, based on similarity or conservative measure of gene expression (TPM > 0.1) as evidence. In comparison with two closely related insect genomes (B. mori has 22,510 genes and Plutella xylostella has 21,674 genes), the predicted gene number is higher. The BLAST search demonstrated that approximately 18,000 genes were conserved among them, and the bagworm genome presumably contains a large number of possible prediction artifacts that did not hit other genomes ( Supplementary Fig. 1). Genes without hits in other genomes tended to have significantly lower expression levels (average being TPM <1). Therefore, the gene  number deviation from other insects may be caused by gene prediction artifacts, but we intend to present a gene set with minimal false negatives, so that the community can curate upon this data. Several partial sequences of the silk fibroin genes have been previously reported in family Pyralidae, family Yponomeutidae, family Papilionidae, or family Geometridae [17][18][19][20][21][22] . On the other hand, the complete sequences have been found only in family Bombycoidea, which includes silkworms and saturniids 23,24 . The complete sequencing of a silk fibroin gene is a great challenge because fibroin genes are generally very large (approximately 10 kbp) and are almost entirely comprised of repetitive sequences between the N/C-terminus domains. Due to such repetitiveness, PCR amplification is not very suitable because it tends to be partial, incomplete or results in chimeric artifacts, but neither can a target capturing approach be applied to a novel gene finding. On the other hand, a well-assembled genome based on long unamplified reads does not rely on isolation or cloning processes. Our bagworm silk fibroin gene was constructed by nanopore read scaffolding, with each domain assembled using short accurate reads based on the draft genome. Although the obtained bagworm fibroin gene shows the typical fibroin structure, the 19 kbp gene size is the largest of the known lepidopteran silk fibroins. The spider silk fibroin (spidroin) gene is a well-investigated fibroin gene in Arthropoda, but among the various spidroins, there are few such large genes 25,26 . Moreover, the repetitive domain size is 500 bp on average, and it is longer than the read length of a general massively parallel sequencer. Analysis of such long repetitive gene was only possible with the nanopore sequencing-based assembly.
These findings on the bagworm silk fibroin gene represent the first report in the ancestral clade Ditrysia and enable the comparative analysis of silk fibroin genes in the order Lepidoptera. The bagworm fibroin gene is characterized not only by its extreme length but also by its unique combination of repetitive motifs (poly-alanine [(A) n ] and alternating glycine-alanine [(GA) n ]), sharing properties of both of the repetitive motifs from the silkworm and saturniid fibroin genes. This mixed motif correlates with the significantly higher tensile strength among moth silks (Fig. 4). The poly-alanine and alternating glycinealanine motifs constitute a β-sheet structure, and the crystal structure makes silk that is strong, rigid and tensile in exchange for its elasticity 27 . In fact, the extensibility of bagworm silk was significantly lower than A. yamamai silk in the absence of alternating glycine-alanine [(GA) n ] (Fig. 4). Since a high pull-off force is also reported in the tea bagworm (Eumeta minuscula) 9 , the biomechanics are considered to be common to the bagworm.  Furthermore, the poly-alanine and alternating glycine-alanine motifs forming the β-sheet structure are commonly utilized in at least two of the spider silk genes (spidroins), namely MaSp genes used in dragline silk and MiSp genes used in web auxiliary spiral threads, well characterized in Araneus ventricosus and Nephila clavipes [27][28][29][30] , but also widely conserved throughout spiders. However, the tensile strength of bagworm silk is remarkably lower than the MaSp and MiSp silks (approximately over 1 GPa) 27 . The MaSp constitutes the frame silk in the web 31 , and the MiSp is used for the auxiliary spiral that stabilizes the web 28 Fig. 4 a The box plots represent the mechanical properties for each Lepidoptera family (Psychidae, Bombycidae, and Saturniidae). For the family Bombycidae sample, three B. mori silks were prepared from a different origin (yellow). For the saturniids samples, S. ricini, A. yamamai, A. pernyi, and A. assama were selected (orange). In comparison with E. variegata silk (blue), the tensile strength was relatively higher than for other silks. Significantly high = *p < 0.05, **p < 0.01, low = †p < 0.05, † †p < 0.01; t test. b Scanning electron micrograph and c 2D WAXS pattern image of bagworm silk. Scale bar in b is 10 µm tensile strength of the bagworm silk threads used only to hang themselves, the spidroin silk is also used for foraging activity, and thus a much stronger property is required for spidroin silk. Therefore, in order to reinforce the silk, the spider may use not only the characteristic motif but also other elements. Meanwhile, even within Lepidoptera, whose members similarly use silk for protective cocoons, different mechanical properties have evolved in silkworms and saturniids as described above. Our genomic characterization of bagworms demonstrated the presence of a fibrohexamerin (P25), which is an additional small polypeptide that constitutes the core of lepidopteran silk 19 . Since P25 is a gene commonly present in Lepidoptera, except for the ghost moth (family Hepialidae) and family Saturniidae 32,33 , this finding suggests that the bagworm silk formation is phylogenetically closer to the silkworm than the saturniid 10,19,34 . However, Simpson's diversity of amino acids in the fibroin repetitive region was low in both bagworms and silkworms (Fig. 3), and it seemed that a purifying selection pressure was equally applied. Nevertheless, the reason why silks with different mechanical properties have appeared may be due to the presence of an amorphous region. The silkworm silk fibroin heavy chain is mainly composed of β-sheet crystallites and amorphous domains, and amorphous blocks account for approximately 10% of the repetitive domain with an average size of 1,237 bp. The bagworm silk fibroin does not have such a large amorphous block (Fig. 2). The hydrophilic amorphous block not only prevents drying of the silk during spinning but also plays a role of absorbing moisture in the environment 35 . In general, it is known that water acts as a plasticizer and has an impact on the mechanical property of silk fibroin; it reduces the Young's modulus 36 and brittleness and increases the flexibility and plasticity. Therefore, since the acquisition of amorphous block in silkworm to utilize the water realized characteristic mechanical property of silk not relying solely on the crystallites domain, various selection pressures have been applied within lepidopteran.

Methods
Bagworm sampling. Eumeta variegata samples were collected from Kanagawa and Chiba Prefecture, Japan (March, 2016). These specimens were initially identified based on morphological characteristics and further confirmed by the transcriptome assembly of cytochrome c oxidase subunit 1 (COI) based on bagworm moth COI sequences 8 . These samples were used for RNA, cDNA, and gDNA sequencing and for measurement of the silk's mechanical properties (Supplementary Table 3).
Genomic DNA extraction. Genomic DNA (gDNA) was extracted from the juvenile E. variegata whole body using Genomic-tip 20/G (QIAGEN) following the manufacturer's protocol. To keep the high molecular weight (HMW) quality, every step was performed as gently as possible. gDNA was extracted from about half of the frozen body. The specimen was homogenized using BioMasher II (Funakoshi) and mixed with 2 ml of Buffer G2 (QIAGEN) including 200 µg/ml RNase A. After the addition of 50 µL Proteinase K (20 mg/mL), the lysate was incubated at 50°C for up to 12 h on a shaker (300 r.p.m.). The lysate was centrifuged at 5,000×g for 5 min at 4°C to pellet the debris, and the aqueous phase was loaded onto a preequilibrated QIAGEN Genomic-tip 20/G (QIAGEN) by gravity flow. The QIAGEN Genomic-tip 20/G (QIAGEN) was then washed three times, and the DNA was eluted with a high-salt buffer (Buffer QF) (QIAGEN). The eluted DNA was desalted and concentrated by isopropanol precipitation and resuspended in 10 mM Tris-HCl (pH 8.0). Extracted gDNA was quantified using a Qubit Broad Range (BR) dsDNA assay (Life Technologies) and qualified using TapeStation 2200 with genomic DNA Screen Tape (Agilent Technologies).
Total RNA extraction. Total RNA was extracted using a transcriptome protocol for field samples, as described previously 37 . Flash-frozen bagworm specimens were immersed into 1 mL TRIzol Reagent (invitrogen) and homogenized with a metal cone using the Multi-Beads Shocker (Yasui Kikai). Extracted RNA was further purified using an RNeasy Plus Mini Kit (Qiagen) automated with QIACube (Qiagen). The quantity of purified total RNA was measured with a NanoDrop 2000 (Thermo Scientific) and a Qubit Broad Range (BR) RNA assay (Life Technologies), and the integrity was estimated by electrophoresis using TapeStation 2200 with RNA Screen Tape (Agilent Technologies).
Library preparation of cDNA and gDNA sequencing. The library for cDNA sequencing with an Illumina sequencer was constructed using a standard protocol from the NEBNext Ultra RNA Library Prep Kit for Illumina (New England Bio-Labs). Approximately 100 µg of total RNA was used for mRNA isolation by NEBNext Oligo d(T) 25 beads (skipping the second bead wash step). The first and second strands of cDNA were synthesized using ProtoScript II Reverse Transcriptase and NEBNext Second Strand Synthesis Enzyme Mix. Synthesized doublestranded cDNA was end-repaired using NEBNext End Prep Enzyme Mix and ligated with a NEBNext Adaptor for Illumina. After the USER enzyme treatment, cDNA was amplified by PCR with the following conditions (20 μL cDNA, 2.5 μL Index Primer, 2.5 μL Universal PCR Primer, 25 μL NEBNext Q5 Hot Start HiFi PCR Master Mix 2× ; 98°C for 30 s and 12 cycles each of 98°C for 10 s, 65°C for 75 s and 65°C for 5 min). The library for direct DNA sequencing with the MinION sequencer was constructed using purified HMW gDNA. The library was completed following 1D library protocol SQK-LSK108 (Oxford Nanopore Technologies). The DNA-Seq library for MiSeq was prepared using a Hyper Plus Kit (KAPA Biosystems).
Sequencing. cDNA sequencing was performed with a NextSeq 500 instrument (Illumina, Inc.) using a 150 bp paired-end read with a NextSeq 500 High Output Kit (300 cycles). Sequenced reads were assessed with FastQC (v0.10.1: http://www. bioinformatics.bbsrc.ac.uk/projects/fastqc/). Short-read sequencing of purified genomic DNA was conducted with a MiSeq instrument (Illumina, Inc.) using a 300 bp paired-end read with a MiSeq Reagent Kit v3 (600 cycles), and long read sequencing was performed by a GridION with v9.4.1 SpotON MinION flow cell. The data sets obtained from this study were deposited and are available at the DNA Data Bank of Japan (DDBJ: http://www.ddbj.nig.ac.jp/) Sequence Read Archive with Accession number PRJDB7096.
De novo genome assembly and error correction. The de novo assembly of the bagworm genome was performed using Canu 1.7 38 with the nanopore DNA-Seq reads. Assembled contigs were polished using the MiSeq reads and NextSeq cDNA reads with Pilon 39 once and twice, respectively. The assembly comprehensiveness and quality were estimated by the read mapping rate and BUSCO score 40 using gVolante 41 . The gDNA-seq reads obtained by nanopore or Illumina sequencer were mapped to the genome with BWA MEM (Burrows-Wheeler Alignment v0.7.12-r1039) or minimap2 version 2.15 (-k 14) 42 , and cDNA-seq read was mapped to the genome with HISAT2 version 2.1.0 43 and the mapping rate was calculated with SAMtools (v 1.3).
Gene prediction and annotation. The gene model generated by the cDNA-seq reads were mapping with HISAT2 version 2.1.0 43 and BRAKER2 version 2.1.0 44 was used for gene prediction. The annotation of predicted genes was obtained by a BLAST search using amino acid sequences against Swiss-Prot and by a HMMER version 3.1b2 45 search against Pfam-A 46 . The number of protein-coding genes was estimated using the intersection or union of transcript abundance and the functional annotations of Swiss-Prot 47 and Pfam. The tRNA and rRNA genes were also predicted with tRNAscan-SE version 1.3.1 48 and Barrnap (https://github.com/ tseemann/barrnap). Swiss-Prot search results were used for the associated GO term annotation (Gene Ontology database: http://www.geneontology.org) to map to a generic GO slim. The overview of gene pathway networks were calculated by the Kyoto Encyclopedia of Genes and Genomes (KEGG) 49 Automatic Annotation Server (KAAS) v 2.1 50 . Homologous genes were identified by bidirectional best hits (BBH) using BLAST search with 1.0e-20 threshold.
Mechanical properties of bagworm silk. Each mechanical property score from silkworms and saturniids was used from a previous study 11 . Using the same method, at least five individual tensile deformation tests were performed for each bagworm silk. The mechanical properties of the bagworm silk fibers were characterized by using an EZ-LX universal tester (Shimadzu, Kyoto, Japan) with a 1 N load cell at 25°C and 48% relative humidity. The strain rate was 10 mm/min (0.033 s-1). For each bagworm silk sample, the cross-sectional area of an adjacent section of the fiber was measured to calculate tensile strength, breaking strain, and Young's modulus. Toughness values were derived from the area under the stressstrain curves.
Wide-angle X-ray scattering (WAXS) measurement. To characterize the crystalline state of the samples, synchrotron WAXS were measured at the BL45XU beamline of SPring-8, Harima, Japan, according to a previous report 54 . The X-ray energy was 12.4 keV at a wavelength of 0.1 nm. Sample-to-detector distance for the WAXS measurements was approximately 257 mm. Exposure time for each diffraction pattern was 10 s. The resultant data were converted into one-dimensional radial integration profiles using the software Fit2D 55 . The resultant data were corrected by subtracting the background scattering. The degree of crystallinity was evaluated from the area of the crystal peaks divided by the total area of the crystal peaks and the amorphous halo by fitting the Gaussian function using Igor Pro 6.3.
Computational analysis and statistics. All computational data curation, treatment, and basic analyses were performed using Perl custom scripts with the Glanguage Genome Analysis Environment v. 1.9.1 56 . Statistical analyses were implemented using R (v 3.2.1). Sequence logo was constructed by WebLogo 3 57 . Protein sequences of B. mori (ASM15162v1) and P. xylostella (DBM_FJ_V1) were obtained from NCBI.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The DNA sequence reads are available at the DNA DataBank of Japan with accession numbers DRR138623-DRR138626. The final DNA sequence assembly is available at the DDBJ with accession numbers BGZK01000001-BGZK01012720.