How an insect evolves to become a successful herbivore is of profound biological and practical importance. Herbivores are often adapted to feed on a specific group of evolutionarily and biochemically related host plants1, but the genetic and molecular bases for adaptation to plant defense compounds remain poorly understood2. We report the first whole-genome sequence of a basal lepidopteran species, Plutella xylostella, which contains 18,071 protein-coding and 1,412 unique genes with an expansion of gene families associated with perception and the detoxification of plant defense compounds. A recent expansion of retrotransposons near detoxification-related genes and a wider system used in the metabolism of plant defense compounds are shown to also be involved in the development of insecticide resistance. This work shows the genetic and molecular bases for the evolutionary success of this worldwide herbivore and offers wider insights into insect adaptation to plant feeding, as well as opening avenues for more sustainable pest management.
The global pest P. xylostella (Lepidoptera: Yponomeutidae) is thought to have coevolved with the crucifer plant family3 (Supplementary Fig. 1) and has become the most destructive pest of economically important food crops, including rapeseed, cauliflower and cabbage4. Recently, the total cost of damage and management worldwide was estimated at $4–5 billion per annum5,6. This insect is the first species to have evolved resistance to dichlorodiphenyltrichloroethane (DDT) in the 1950s7 and to Bacillus thuringiensis (Bt) toxins in the 1990s8 and has developed resistance to all classes of insecticide, making it increasingly difficult to control9,10. P. xylostella provides an exceptional system for understanding the genetic and molecular bases of how insect herbivores cope with the broad range of plant defenses and chemicals encountered in the environment (Supplementary Fig. 2).
We used a P. xylostella strain (Fuzhou-S) collected from a field in Fuzhou in southeastern China (26.08 °N, 119.28 °E) for sequencing (Supplementary Fig. 1). Whole-genome shotgun–based Illumina sequencing of single individuals (Supplementary Table 1), even after ten generations of laboratory inbreeding, resulted in a poor initial assembly (N50 = 2.4 kb), owing to high levels of heterozygosity (Supplementary Figs. 3 and 4 and Supplementary Table 2). Subsequently, we sequenced 100,800 fosmid clones (comprising ∼10× the genome length) to a depth of 200× (Supplementary Fig. 5 and Supplementary Tables 3–5), assembling the resulting sequence data into 1,819 scaffolds, with an N50 of 737 kb, spanning ∼394 Mb of the genome sequence (version 1; Supplementary Fig. 6 and Supplementary Table 6). The assembly covered 85.5% of a set of protein-coding ESTs (Supplementary Tables 7 and 8) generated by transcriptome sequencing11. Alignment of a subject scaffold against a 126-kb BAC (GenBank GU058050) from an alternative strain (Geneva 88) showed extensive structural variations between haplotypes. However, the coding sequence of the nicotinic acetylcholine receptor α6 gene (spanning >75 kb)12 on the BAC and the subject scaffold was relatively conserved (Supplementary Fig. 7). Whole-genome shotgun reads from three libraries (500 bp, 5 kb and 10 kb) were mapped to the BAC and corresponding scaffold, covering 86.7% and 98.1% of sites, respectively (Supplementary Fig. 7), indicating high polymorphism levels between the alleles. Genome-wide exploration of variation identified abundant SNPs, insertions and/or deletions (indels), structural variations and complex segmental duplication patterns within the sequenced population of the Fuzhou-S strain (Fig. 1, Supplementary Figs. 8 and 9, Supplementary Tables 9–13 and Supplementary Note). Thus, we generated a genome of ∼343 Mb (version 2) for annotation and analysis by masking ∼50 Mb of possible allelic redundancy in the version 1 assembly (Supplementary Fig. 10, Supplementary Table 14 and Supplementary Note).
The P. xylostella genome is predicted to contain 18,071 protein-coding genes (Supplementary Fig. 11 and Supplementary Tables 15–18) and 781 non-coding RNAs (Supplementary Table 19), with 33.97% of the genome made up of repetitive sequences (Supplementary Fig. 12, Supplementary Table 20 and Supplementary Note). Compared with the genomes of other sequenced insect species, the P. xylostella genome possesses a relatively larger set of genes and a moderate number of gene families (Supplementary Table 21), suggesting the expansion of certain gene families. In addition to 1,683 Lepidoptera-specific genes (Supplementary Table 22 and Supplementary Note), we found 1,412 P. xylostella–specific genes (Supplementary Fig. 13), exceeding in number the 463 Bombyx mori–specific genes13 and the 1,184 Danaus plexippus–specific genes14 (Fig. 2). The P. xylostella–specific genes were largely involved in biological pathways essential for environmental information processing, chromosomal replication and/or repair, transcriptional regulation and carbohydrate and protein metabolism (Supplementary Fig. 14 and Supplementary Table 23). These findings suggest that P. xylostella has an intrinsic capacity to swiftly respond to environmental stress and genetic damage.
Phylogenetic analysis indicated that the estimated divergence time of insect orders was approximately 265–332 million years ago (Fig. 2). This is around the time of the divergence of mono- and dicotyledonous plants (∼304 million years ago)15, consistent with the coevolution and concurrent diversification of insect herbivores and their host plants. It can be predicted that P. xylostella became a cruciferous specialist when Cruciferae diverged from Caricaceae (∼54–90 million years ago)16. This estimated time provides additional evidence to support our estimation of the divergence time (∼124 million years ago) of P. xylostella from two other Lepidoptera, B. mori and D. plexippus (Fig. 2). The genome-based phylogeny showed that P. xylostella is a basal lepidopteran species (Fig. 2), and this idea is well supported by its modal karyotype of n = 31 (refs. 17,18) and the molecular phylogeny of Lepidoptera19,20, indicating the importance of P. xylostella in the history of lepidopteran evolution.
On the basis of P. xylostella transcriptome data11, we identified 354 preferentially expressed genes in larvae (Supplementary Fig. 15), and a set of these genes is involved in sulfate metabolism, some of which were validated using quantitative RT-PCR for gene expression analysis (Supplementary Figs. 16–18, Supplementary Table 24 and Supplementary Note). Glucosinolate sulfatase (GSSs) enables P. xylostella to feed on a broad range of cruciferous plants by catalyzing the conversion of glucosinolate defense compounds into desulfoglucosinolates, thus preventing the formation of toxic hydrolysis products3 (Supplementary Fig. 2). In order to function, all sulfatases require post-translational modification by sulfatase-modifying factor 1 (encoded by SUMF1)21, which regulates the sulfatase whose higher activities depend on greater amounts of sulfatase and SUMF1 transcripts22. We found that high expression of P. xylostella SUMF1 in third-instar larvae was coupled with significantly higher expression of the two GSS genes relative to other members of the P. xylostella sulfatase gene family (Fig. 3). We propose that the coevolution of SUMF1 and GSS genes was key in P. xylostella becoming such a successful herbivore of cruciferous plants (Supplementary Fig. 2). Furthermore, a new gene, predicted to be a sodium-independent sulfate anion transporter, was highly expressed in all larval stages and in the midgut (Fig. 4) and is likely associated with the excretion of toxic sulfates23.
In comparisons with the larval midgut proteome of the polyphagous lepidopteran Helicoverpa armigera24, we found similar digestive enzymes encoded by P. xylostella larval preferentially expressed genes that were expressed predominantly in the midgut (Supplementary Fig. 19 and Supplementary Table 25). The abundant larval midgut-specific serine proteinase genes in the P. xylostella genome may circumvent the action of insecticidal plant protease inhibitors through differential expression in response to different plant hosts25 (Supplementary Fig. 20). Among the P. xylostella larval preferentially expressed genes, we identified a set of genes, including GOX (encoding glucose oxidase), related to the host range of herbivores26 and involved in the perception of chemical signals from host plants and defense against secondary plant compounds (Fig. 4, Supplementary Table 25 and Supplementary Note), suggesting the presence of a complex chemoreception network and multiple detoxification mechanisms.
We identified five chemoreception gene families related to larval feeding preferences and adult searching for host plants: odorant receptors (ORs), odorant-binding proteins (OBPs), gustatory receptors (GRs), ionotropic receptors (IRs) and chemosensory proteins (CSPs) (Supplementary Fig. 21, Supplementary Table 26 and Supplementary Note). Notable among these genes is an expansion of ORs but not GRs, as reported in the B. mori genome27. Species-specific expansion of CSPs in moths is less than that observed in butterflies18. Lifecycle- and tissue-specific expression of ORs identified 30 variable, 23 constitutive and 9 adult-specific expression patterns (Supplementary Fig. 22), indicating that P. xylostella possesses a high potential for adaptation to chemical cues from host plants (Supplementary Fig. 2).
Detoxification pathways used by insect herbivores against plant defense compounds may be co-opted for insecticide tolerance28 or resistance (Supplementary Fig. 2). We found that P. xylostella possessed an overall larger set of insecticide resistance–related genes than B. mori, which is monophagous and has had little exposure to insecticide over 5,000 years of domestication13 (Supplementary Table 27). We identified in the P. xylostella genome apparent gene duplications of most ATP-binding cassette (ABC) transporter families and three classes of major metabolic enzymes, the P450 monooxygenases (P450s), glutathione S-transferases (GSTs) and carboxylesterase (COEs) (Supplementary Fig. 23 and Supplementary Table 26). These genes are known to have important roles in xenobiotic detoxification in insects29,30 (Supplementary Note). Among the four gene families, the ABC transporter gene family in P. xylostella is much more expanded compared to the corresponding family in B. mori (Fig. 5a). Larval transcriptomes were sequenced from the Fuzhou-S strain that was genotyped and from two substrains selected for resistance to chlorpyrifos or fipronil11. ABC transporter genes were upregulated more frequently than GSTs, COEs or P450s in insecticide-resistant larvae (Supplementary Fig. 24), highlighting the potential role of ABC transporters in detoxification.
We then investigated the genomic variations and transposable elements in genes and their 2-kb upstream regions in these four families, some of which were validated using Sanger sequencing (Supplementary Tables 28–31 and Supplementary Note). On average, transposable elements (∼20 per gene) were abundant, followed in frequency by structural variations (∼16), SNPs (∼6) and indels (<1), near these gene families (Supplementary Fig. 25). The coding sequences of COEs were rich in SNPs (Supplementary Fig. 25a), which can be critical in determining COE substrate specificity and catalytic activity under xenobiotic stresses31. Principal-component analysis indicated that intronic regions consistently harbored all types of variations, whereas structural variations and transposable elements frequently occurred in coding sequences, which may largely affect gene functions (Fig. 5b). Transposable elements were abundant within or near the P450s involved in induced xenobiotic detoxification in insects, whereas those related to constitutive developmental metabolism were free of transposable element insertions32. Our findings show that numerous transposable elements accompany the gene families involved in metabolic detoxification sensitive to external stresses (Supplementary Table 32). These associations seem to be a consistent trend in Lepidoptera (Supplementary Fig. 25b). The transposable element orders of long terminal repeat (LTR) and long interspersed nuclear element (LINE) were predominant in P. xylostella and B. mori, respectively, and the proportional composition of various transposable element orders tended to be similar in different gene families for each of the species (Fig. 5c). A recent expansion of the LTR retrotransposons (>90%) in the P. xylostella genome has occurred over the past 2 million years, occurring much later than the expansion of B. mori LTRs (Fig. 5d) and possibly reflecting the timing of extensive adaptive evolutionary events in P. xylostella33. The polymorphism within the P. xylostella genome might support adaptation to host plant defenses and insecticides by providing a repertoire of alternative alleles or cis-regulatory elements29 and genetic variations34 for gene expression.
In this project, we developed a new approach for non-model insect genome sequencing using next-generation sequencing technology and de novo assembly of the highly polymorphic genome. Analyses identify complex patterns of heterozygosity, the expansion of gene families associated with perception and the detoxification of plant defense compounds and the recent expansion of retrotransposons near detoxification genes. These adaptations reflect the diversity and ubiquity of toxins in its host plants and underlie the capacity of P. xylostella to rapidly develop insecticide resistance. This study provides insights into the genetic plasticity of P. xylostella that underlies its success as a worldwide herbivore. The genomic resources described here will facilitate future studies on the adaptation and evolution of other arthropods and support the incorporation of molecular information into the development of strategies for more sustainable agriculture.
FTP site, ftp://ftp.genomics.org.cn/pub/Plutellaxylostella/; LASTZ, http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html; Infonet Biovision, http://www.infonet-biovision.org/; North American Moth Photographers Group, http://mothphotographersgroup.msstate.edu/MainMenu.shtml; Interactive Agricultural Ecological Atlas of Russia and Neighboring Countries, http://www.agroatlas.ru/; the diamondback moth (DBM) genome database, http://iae.fafu.edu.cn/DBM.
Strain for sequencing.
A strain of the diamondback moth (DBM) (Fuzhou-S), P. xylostella, was reared on radish seedlings without exposure to insecticides for 5 years, spanning at least 100 generations. An inbred line was developed by successive single-pair sibling matings. Male pupae were used for genome sequencing.
Whole-genome shotgun sequencing and assembly.
Individual DNA from the inbred F1, F4 and F10 insects was used for construction of paired-end libraries (Supplementary Table 1). Sequencing was performed using the Illumina Genome Analyzer IIx or HiSeq 2000 platform. Short reads were assembled using SOAPdenovo35.
Fosmid-to-fosmid sequencing and assembly.
DNA was extracted from a pool of ∼1,000 male pupae using a CATB-based method. A fosmid library with insert sizes ranging from 35 to 40 kb was constructed. We sequenced 100,800 single colonies to achieve 10× coverage of the genome. For each colony, two paired-end libraries with 250-bp and 500-bp fragments were constructed and sequenced. On average, each library was sequenced >200× with a total of 114 lanes and an output of 855 Gb. Vector or contaminated DNA and poor reads with >10% unknown nucleotides or >40 bases with quality value of ≤5 were filtered out36.
We developed custom software (Rabbit) for assembling sequences with large overlaps (>2 kb). Rabbit contains three modules: Relation Finder, Overlapper and Redundancy Remover.
We used the Poisson-based K-mer model to determine repeat sequences, segmental duplications or divergent haplotypes. Each K-mer was defined as either a 'repeat' or 'unique' K-mer, depending on whether its occurrence frequency was greater or less than twice the average frequency, respectively (Supplementary Fig. 10), using the Poisson model
where λ is the expected frequency for K-mers, y is the given frequency of a particular K-mer and P is the occurrence probability of a given K-mer frequency. Therefore, the probability of a unique K-mer being greater than twice the expected frequency is given by the following equation.
Few unique K-mers can occur with a frequency larger than twice the expected value, especially when the expected frequency is ≥20 (Supplementary Table 14). Rabbit is capable of connecting these unique regions and removing redundancy. We chose K = 17 bp36,37 and trimmed repeat sequence ends (Supplementary Fig. 4).
We used SSPACE38 to build scaffolds and SOAP-GapCloser35 to fill the gap with 131.2× whole-genome shotgun short reads (Supplementary Table 1). This resulted in a genome with 394 Mb (version 1), slightly larger than the estimated haploid genome size (339.4 Mb)17. We extracted all similar sequences with LAST39 and retained one copy of the sequences containing >40% unique K-mers and masked the others with 'n' to generate a revised genome of ∼343 Mb (version 2).
Digital gene expression (DGE).
Quantitative RNA-seq was conducted for newly laid eggs, fourth-instar larvae, the midguts of fourth-instar larvae, pupae (>2 d), virgin male and female adults, and the heads of fourth-instar larvae and male or female adults. Paired-end libraries (insert size of 200 bp) were sequenced with read length of 49 bp. The RPKM40 values were calculated for DGE profiling.
Larval preferentially expressed gene analysis.
On the basis of the DBM genome and the transcriptomes for newly laid eggs, third-instar larvae, pupae and virgin adults, we analyzed differential gene expressions in four developmental stages using the same statistical approach11. The larval preferentially expressed genes were defined as genes that were highly expressed in the larval stage compared to the other three developmental stages, with RPKM ratio ≥ 8 fold (upregulated) and false discovery rate (FDR) ≤ 0.001.
We used Augustus (v 2.5.5)41, Genscan42 and SNAP43 for de novo gene prediction, compared the candidate genes to the transposable element protein database using BLASTP (1 × 10−5) and removed genes that showed over 50% similarity to the transposable elements. The predicted proteomes of D. melanogaster, B. mori, Anopheles gambiae and Tribolium castaneum were aligned with the DBM genome using TBLASTN (1 × 10−5). High-scoring segment pairs (HSPs) were grouped using Solar (v. 0.9.6)36. We extracted target gene fragments and extended 500 bp at both ends. GeneWise (v. 2.2.0)44 was used for the alignment of fragments to a protein set. We clustered the predicted genes with an overlap cutoff of >50 bp. The results of de novo and homolog-based predictions were incorporated into a gene set using GLEAN45.
Integration of transcriptome data with the GLEAN set.
Transcriptome reads11 were mapped onto the genome using TopHat46. We then used Cufflinks47 (with default parameters) to assemble transcripts and integrated the transcripts with the GLEAN set by filtering out redundancy and the genes with ≥10% uncertain bases and coding region lengths of ≤150 bp.
The integrated gene set was translated into amino-acid sequences, which were used to search the InterPro database48 by Iprscan (v 4.7)49. We used BLAST to search the metabolic pathway database50 (release58) in KEGG and homologs in the SwissProt and TrEMBL databases in UniProt51 (release 2011-01).
Annotation of repetitive sequences.
We used RepeatProteinMask and RepeatMasker (version 3.2.9) from Repbase (version 16.03)52 to search for transposable elements. We constructed a de novo repeat library using RepeatScout (v 1.0.5)53, Piler (v 1.0)54 and LTR_FINDER (v 1.0.5)55 and annotated the transposable element regions with RepeatMasker. Simple tandem repeats were annotated using TRF (v 4.04)56.
We used the shortest length standards for each transposable element order from Repbase (v 16.03)52 to filter the integrated results. To estimate the expansion time of LTRs in the P. xylostella and B. mori genomes, we investigated the LTRs using LTR_STRUC57. Both 5′ and 3′ LTR regions of the LTR retrotransposons were extracted and aligned to each other using MUSCLE58. Distmat from EMBOSS59 was used to calculate the times since the divergence of the 5′ and 3′ LTRs.
Annotation of non-coding RNA.
We used tRNAscan-s.e.m. (v 1.23)60 to search for tRNA-coding sequences. Invertebrate rRNA from the database61 was used to predict DBM rRNA sequences. Rfam62 (v 9.1) was used in conjunction with INFERNAL63 to predict small nuclear RNAs (snRNAs) and microRNAs (miRNAs).
Gene family construction.
The predicted proteomes in the DBM genome and those from the genomes of 11 insect species13,14,64,65,66,67,68,69,70,71 and 1 Arachnida outgroup species72 were used in BLAST (1 × 10−7). The fragmental alignments of HSPs were joined using Solar36. Clustering was performed to generate gene families using hcluster_sg73. The species-specific genes are those for which we could not find orthologs in the predicted gene repertoires of the compared genomes.
We used phase 1 nucleotides of single-copy genes from different genomes and MCMCTREE from PAML74 to estimate the time divergence time of DBM. Sampling was replicated 100,000 times with a frequency of 2 (the first 10,000 trials were disregarded).
Linkage mapping of scaffolds.
RADseq data generated from a cross between DBM strains Pearl-Sel and Geneva88 (ref. 17) were used. Read mapping for each individual was performed using Stampy (v. 1.0.13)75. Polymorphisms were called using the UnifiedGenotyper (v. 1.3-21)76. A custom PERL script identified segregating polymorphic patterns. A genotype file formatted for JoinMap (v. 3.0)77 was produced. Scaffolds were assigned onto corresponding linkage groups on the basis of the alignment result with the RAD alleles (Supplementary Table 9).
Comparison of genomic synteny.
We fragmented the fosmid sequences in silico into 100-bp single-end reads or paired-end reads (insert size of 500 bp). We used SOAPaligner/soap235 to map the reads onto reference sequences and SOAPsnp79 and SOAPIndel35 to annotate SNPs and indels, respectively (with acceptable depths ranging from 3 to 30). On the basis of the sequencing of a single Fuzhou-S individual (Supplementary Table 1, SI), SOAPsv80 was employed for annotating structural variations. We performed whole-genome alignment comparison using LASTZ. The regions that were ≥1 kb with identity of ≥90% were regarded as segmental duplications.
Annotation of genes concerned.
On the basis of available protein sets (Supplementary Table 26) and the predicted proteomes of P. xylostella, B. mori and D. melanogaster, BLASTP was used to search for the homologs in each of the three genomes. We applied cutoffs at 1 × 10−20, bit-score of 100 and coverage of 100 continuous amino acids for gapped alignment. We filtered out the results with total coverage of alignment of <70% for the same species and <40% for different species. We also used InterProScan81 to search for candidate genes on the basis of conserved motifs from InterPro48. The candidates were manually checked against the Conserved Domain Database82 in NCBI to validate the gene searching results and confirm that the method used in our DBM genome was as effective and reliable as the methods used in other insect genomes.
We randomly selected 20 each of annotated SNPs, structural variations (≥50 bp and ≤200 bp) and transposable elements (≥300 bp and ≤600 bp) within or around the metabolic detoxification genes. PCR primer sets were designed for each of them to amplify an 800-bp region (Supplementary Table 31). Direct Sanger sequencing was performed for PCR products from both ends. Alignments between sequencing results and the reference genome were performed using BLAST or BLAT83.
Quantitative RT-PCR validation.
We used 20 genes for validation of host plant responsiveness, and another 20 genes to examine differential expressions over the life cycle (Supplementary Table 24). We also used a B. thuringiensis strain containing CryIIAd (GenBank DQ358053) to infect the DBM strain and determine the gene expression for sulfate metabolism. Third-instar larvae were treated with CryIIAd (7.589 μg·/ml) by the leaf-soaking method84, with double-distilled water as control or no food supply for starvation. RT-PCR was performed for quantitative gene expression based on the 2−ΔΔCT method85, with the ribosomal protein L32 gene (GenBank AB180441) serving as an internal reference. Each experiment was repeated three times.
The genome described herein is the first reference genome of P. xylostella, AHIO01000000. Genome assemblies and annotations described here have been deposited at the DNA Data Bank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL) and GenBank under accession AHIO00000000. Raw sequencing data from the transcriptome have been deposited at the NCBI Short Read Archive (SRA) under accession SRA034927.
Sequence Read Archive
NCBI Reference Sequence
This work was supported through a special project of Research on Diamondback Moth Genomics (grant JB09315) to M.Y. and a Minjiang Scholar Program to L.V., G.M.G., C.J.D. and S.M.S. by the Educational Department of Fujian Province and through a key project (grant 31230061) to M.Y. from the National Natural Science Foundation of China. Insect rearing and sampling, as well as some of the DNA extractions, were conducted at the Fujian Provincial Key Laboratory of Biodiversity and Eco-safety and the Key Laboratory of Integrated Pest Management for Fujian-Taiwan Crops, the Ministry of Agriculture, China. We are grateful to A.D. Briscoe (University of California–Irvine) for her help in organizing and for providing ORs, OBPs and CSPs from Danaus plexippus and Heliconius melpomene and to G.L. Lövei for his comments and suggestions on the manuscript. We appreciate J. Liao and M. Zou for providing the Bt-treated P. xylostella larvae used for quantitative gene expression analysis. We thank H. Wang, J. Luo, Y. Hong, S. Pan, L. Yang, Y. Weng, Y. Hong and Y. Liu for their technical assistance in rearing insects and preparing samples.
Supplementary Note, Supplementary Figures 1–25 and Supplementary Tables 1–32