Soil-transmitted nematodes, including the Strongyloides genus, cause one of the most prevalent neglected tropical diseases. Here we compare the genomes of four Strongyloides species, including the human pathogen Strongyloides stercoralis, and their close relatives that are facultatively parasitic (Parastrongyloides trichosuri) and free-living (Rhabditophanes sp. KR3021). A significant paralogous expansion of key gene families—families encoding astacin-like and SCP/TAPS proteins—is associated with the evolution of parasitism in this clade. Exploiting the unique Strongyloides life cycle, we compare the transcriptomes of the parasitic and free-living stages and find that these same gene families are upregulated in the parasitic stages, underscoring their role in nematode parasitism.
More than 1 billion people are infected with intestinal nematodes1,2. The World Health Organization has classified infections with soil-transmitted nematodes as one of the 17 most neglected tropical diseases and estimates that worldwide these infections cause an annual disease burden of 5 million years lost due to disability (YLD), greater than the annual disease burdens of malaria (4 million YLD) and HIV/AIDS (4.5 million YLD). Parasitic nematode infections can impair physical and educational development1.
Strongyloides species are soil-transmitted gastrointestinal parasitic nematodes infecting a wide range of vertebrates3. Two species—S. stercoralis and Strongyloides fuelleborni—infect some 100–200 million people worldwide4,5. Other Strongyloides species infect livestock, such as Strongyloides papillosus that infects sheep.
Strongyloides species are from a clade of nematodes6,7,8 that includes taxa with diverse lifestyles, including a free-living lifestyle (Rhabditophanes), parasitism of invertebrates, facultative parasitism of vertebrates (Parastrongyloides) and obligate parasitism of vertebrates (Strongyloides)6,7. Nematodes have independently evolved parasitism of animals several times9, and thus understanding the genomic adaptations to parasitism in one clade will help in understanding how parasitism has evolved across the phylum more widely.
The Strongyloides life cycle alternates between free-living and parasitic generations. The female-only, parthenogenetic10 parasitic stage lives in the small intestine of its host where it produces offspring that develop outside of the host, either directly into infective third-stage larvae (iL3s) or into a dioecious, sexually reproducing adult generation11 whose progeny are iL3s. iL3s penetrate the skin of a host and migrate to its gut12, where they develop into parasitic adults (Fig. 1). Therefore, this life cycle has two genetically identical adult female stages—one obligate and parasitic and one facultative and free-living; we have compared these stages at the transcriptome and proteome levels to identify the genes and gene products specifically present in the parasitic stage. The closely related genus Parastrongyloides3,13 is similar to Strongyloides species, except that its parasitic generation is dioecious and sexually reproducing and that it can have apparently unlimited cycles of its free-living adult generation3,14 (Fig. 1).
Here we report the genome sequences for six nematodes from one clade: four species of Strongyloides, S. stercoralis (a parasite of humans and dogs), Strongyloides ratti and Strongyloides venezuelensis (both parasites of rats and important laboratory models of nematode infection), and S. papillosus (a parasite of sheep); P. trichosuri (which infects the brushtail possum Trichosurus vulpecula); and the free-living nematode Rhabditophanes6.
To investigate the genomic and molecular basis of parasitism in these nematodes, we compared (i) the genomes and gene families of the parasitic (Strongyloides and Parastrongyloides, the Strongyloididae) and free-living (Rhabditophanes) taxa (Fig. 1); (ii) the transcriptomes of parasitic adult females, free-living adult females and iL3s from S. ratti and S. stercoralis; and (iii) the proteomes of parasitic and free-living females of S. ratti. We have identified the genes present in the parasitic species and the genes and gene products uniquely upregulated in the parasitic stages of S. stercoralis and S. ratti; together, these are the major genomic and molecular adaptations to the parasitic lifestyle of these nematodes.
We have produced a high-quality 43-Mb reference genome assembly for S. ratti (Supplementary Note), with its two autosomes15 assembled into single scaffolds and the X chromosome15 assembled into ten scaffolds (Fig. 2 and Table 1). This assembly is the second most contiguous assembled nematode genome after the Caenorhabditis elegans reference genome16. We also produced high-quality draft assemblies of the 42- to 60-Mb genomes of S. stercoralis, S. venezuelensis, S. papillosus, P. trichosuri and Rhabditophanes sp. KR3021, which were 95.6–99.6% complete (Supplementary Table 1). With GC contents of 21% and 22%, respectively, the S. ratti and S. stercoralis genomes are the most AT rich reported thus far for nematodes (Supplementary Table 1). The ∼43-Mb S. ratti and S. stercoralis genomes are small compared with the genomes of other nematodes. However, the total protein-coding content of each nematode genome is similar (18–22 Mb versus 14–30 Mb in eight outgroup species; Supplementary Table 1). Significant loss of introns, as well as shorter intergenic regions, accounts for the smaller genomes in the present study (Spearman's correlation between genome size and intron number ρ = 0.91, P < 0.001 and size of intergenic regions ρ = 0.63, P = 0.02; Supplementary Table 2). However, parsimony analysis of intronic positions conserved in two or more species showed that substantial intron losses occurred before the evolution of the Rhabditophanes-Parastrongyloides-Strongyloides clade (Supplementary Fig. 1) and are therefore not an adaptation associated with parasitism.
The canonical view of a nematode chromosome, defined nearly 20 years ago using C. elegans autosomes (and later confirmed in Caenorhabditis briggsae17) is of a gene-dense, repeat-poor 'center' of conserved genes (defined by homology with yeast genes16), flanked by two gene-poor, repeat-rich 'arms' in which most genes are less strongly conserved. S. ratti is the first non-Caenorhabditis nematode whose whole chromosomes have been assembled, and it presents a strikingly different organization, with relatively little variability in gene density, repeat density or gene conservation to yeast genes along its autosomes (Supplementary Figs. 2 and 3).
Synteny is highly conserved within the parasitic Strongyloididae but is much less conserved between this family and Rhabditophanes (Fig. 2). Scaffolds of the parasitic species largely correspond to blocks from a particular S. ratti chromosome but in a scrambled order. This suggests that intrachromosomal rearrangement is frequent but interchromosomal rearrangement is rare, a common phenomenon in nematode chromosome evolution17,18,19. The notable exceptions are the S. papillosus and S. venezuelensis scaffolds that have many blocks that are syntenic to both S. ratti chromosomes I and X (Supplementary Table 3). This pattern of synteny likely reflects the fusion event between chromosomes I and X in these species20,21,22. Associated with this fusion is a change in the chromosome biology of sex determination in these species. S. papillosus undergoes chromatin diminution (where a chromosome fragments, after which part of the chromosome is eliminated during mitosis) to mimic the XX/XO sex-determining system of S. ratti23 and S. stercoralis20.
By analyzing the differential coverage of mapped sequence data from iL3s (which are all female) and adult males, we were able to identify regions of the S. papillosus X-I fusion chromosome that are eliminated from males during diminution (Supplementary Table 4). Six scaffolds were identified from the diminished region using existing genetic markers (Supplementary Table 5), but our read depth approach extended this map to 153 scaffolds (18% of the assembly; 10.9 Mb). Interestingly, some genes with orthologs on the X chromosome of S. ratti are not diminished in S. papillosus, so the dosage of these genes in males has changed since the species diverged, including three genes on S. papillosus chromosome II (confirming earlier work20) and 33 genes that lie in non-diminished regions of the X-I fusion chromosome (Supplementary Table 6).
Extensive rearrangement of the mitochondrial gene order
The S. stercoralis mitochondrial genome is highly rearranged compared with the genomes of nematodes from clades I, III and V (ref. 24). Manual finishing of the mitochondrial genomes of the six species showed that the Rhabditophanes mitochondrial genome consists of two circular chromosomes, a feature of some other nematode species25. Compared with eight outgroup species, Rhabditophanes sp. KR3021 has a conventional gene order but Strongyloides species and P. trichosuri have highly rearranged mitochondrial genomes (Fig. 2 and Supplementary Table 7). Similar observations have been reported in other clade IV parasitic nematodes25,26,27,28, and there is evidence of mitochondrial recombination27,29, which is rarely observed in animals30. Consistent with published nematode mitochondrial genomes, the gene-based phylogeny of the mitochondrial genome (Fig. 2) conflicts with phylogenies based on nuclear genes27,31,32, and the rearranged gene order of the mitochondrial genomes of Strongyloides species is accompanied by nucleotide divergence (Fig. 2).
Gene families associated with the evolution of parasitism
We predicted 12,451–18,457 genes across the six genomes, numbers comparable to those in other nematode species (Table 1 and Supplementary Fig. 4). We then used Ensembl Compara (Supplementary Note)33 to identify orthologs and gene families (Supplementary Table 8) in these and eight outgroup species, encompassing four further nematode clades (Supplementary Fig. 4). By pinpointing when a new gene family arose and where a family has expanded or contracted, we could determine which gene families are associated with the evolution of parasitism. The largest acquisition of gene families (1,075 families) was found on the branch leading to the parasitic nematodes, including the Strongyloides species and P. trichosuri (Fig. 1 and Supplementary Fig. 4). Despite this highly dynamic pattern of gene gain and loss within each sepcies' genome, the proportion of genes specific to Strongyloides (and Strongyloididae) is consistent across the phylogeny (Fig. 1). The branches leading to the five parasitic species also showed greater expansion of genes and families of genes as compared to that in the free-living Rhabditophanes sp. KR3021. Gain and expansion of gene families in these parasitic species likely reflects the necessary adaptations required by these species to be able to parasitize vertebrate hosts while maintaining a free-living life cycle phase.
The two most expanded Strongyloides gene families encode astacin-like34 and SCP/TAPS (SCP/Tpx-l/Ag5/PR-1/Sc7 (ref. 35), also known as CAP-domain) proteins, present in multiple subfamilies (according to Ensembl Compara analysis (Supplementary Table 8) and protein domain combinations (Supplementary Table 9)). The astacin family of metallopeptidases was the most expanded, with 184–387 copies in the Strongyloides-Parastrongyloides species as compared with Rhabditophanes and with eight outgroup species, showing that this expansion accompanies the evolution of parasitism (Fig. 1 and Supplementary Table 10). Among the outgroup species, the hookworm Necator americanus36 has 82 astacin-encoding genes and the free-living C. elegans has 40 (ref. 34).
SCP/TAPS proteins are often immunomodulatory molecules in parasitic nematodes35 and have been investigated as potential vaccine candidates against N. americanus37,38. We found 89–205 SCP/TAPS-encoding genes in the Strongyloides genomes, including nine subfamilies not present in P. trichosuri, Rhabditophanes sp. KR3021 or the eight outgroup species (Supplementary Tables 8 and 10). In N. americanus, there are 137 SCP/TAPS-encoding genes36, suggesting that this gene family has independently expanded twice, in nematode clades IV and V.
Additional gene expansions included receptor-type protein tyrosine phosphatases, which have a putative role in signaling39 and are expanded in Strongyloides and Parastrongyloides (52–75 genes) compared with Rhabditophanes (13 genes) and the eight outgroup species (up to 39 genes). Acetylcholinesterase-encoding genes were expanded in Strongyloides and Parastrongyloides (30–126 genes) compared with Rhabditophanes (1 gene) and our outgroup species (1–5 genes). Many parasitic nematodes secrete acetylcholinesterases, which are thought to facilitate their maintenance in hosts40, and the expansion of this gene family in these parasitic species is consistent with this role. Some families show subclade-specific expansion; for instance, S. papillosus and S. venezuelensis have a paralogous expansion of genes encoding Speckle-type POZ domains41 (92–130 genes) compared with S. ratti and S. stercoralis (9 or 10 genes) (Fig. 1 and Supplementary Table 8).
No function or annotation could be assigned to approximately one-third (26–37%) of the genes present in the six species, but 50% of these genes could be assigned to novel gene families. The six largest of these families occurred only in Strongyloides and Parastrongyloides, comprising a total of 630 genes. We have named these Strongyloides genome project families sgpf-1 to sgpf-6. Members of sgpf-1 and sgpf-5 are predicted to encode proteins with signal peptides that are highly glycosylated (Supplementary Table 11).
Expanded gene families are upregulated in parasitic stages
We identified genes and gene families that are likely to have a key role in the parasitic lifestyle of S. ratti and S. stercoralis by comparing the transcriptomes of parasitic and free-living female stages. We generated S. ratti transcriptome data and used previously published S. stercoralis data42. A total of 909 S. ratti and 1,188 S. stercoralis genes were upregulated in parasitic females as compared with free-living females (edgeR, fold change > 2, false discovery rate (FDR) < 0.01; Supplementary Tables 12 and 13), of which 423 S. ratti and 457 S. stercoralis orthologous genes were upregulated in the parasitic female stage of both species (Supplementary Table 14).
The two most expanded Strongyloides gene families—encoding SCP/TAPS35 and astacin-domain43,44,45,46 proteins—dominated the list of genes differentially expressed by parasitic females. In S. ratti and S. stercoralis, respectively, 58 and 62% of putative astacin-like genes and 57 and 71% of SCP/TAPS genes were differentially expressed between parasitic and free-living females (Fig. 3 and Supplementary Tables 10 and 13). However, other paralogously expanded genes were not enriched among the upregulated genes, suggesting that they may not be important for parasitism. Both Strongyloides and Parastrongyloides infect their hosts by skin penetration; the larvae then migrate through the host, and adult females in the host live in the mucosa of the small intestine47,48 where they feed on the host. Astacins are metallopeptidases that have previously been associated with a role in tissue migration by nematode infective larvae44,49. Around half of the putative astacin-like proteins in Strongyloides species contain the canonical zinc-binding motif (HEXXHXXGXXH) of astacin active sites and likely have a role in penetrating the host mucosa in which parasitic females live. Teasing apart the role of different astacin gene family members in the migration and gut-dwelling phases of this life cycle could provide insights to allow new therapeutic interventions to be developed. For S. ratti and S. stercoralis, respectively, 63 and 53% of the SCP/TAPS genes upregulated in parasitic females encode a signal peptide, suggesting that the proteins may be secreted from the worm into the host. An immunomodulatory role for SCP/TAPS proteins has also been proposed on the basis of the inhibitory effect that these proteins have on neutrophil and platelet activity in hookworm infections35,50,51.
Other gene families commonly upregulated in the parasitic females of both species, as compared with free-living females and iL3s, included ones encoding transthyretin-like proteins, prolyl endopeptidases, acetylcholinesterases, trypsin inhibitors and aspartic peptidases (Fig. 3 and Supplementary Table 15). The transthyretin-like genes had some of the highest fold changes in expression of genes upregulated in parasitic females (Supplementary Table 13). Transthyretin-like genes constitute a large, nematode-specific gene family52, are expressed in adult parasitic stages53,54,55 and are distant relatives of the vertebrate transthyretins that are involved in transporting thyroid hormones56. While some aspartic peptidases are essential for the digestion of host hemoglobin in blood-borne parasites57,58, it has been proposed that others are involved in digesting other host macromolecules59.
Hypothetical protein-coding genes accounted for 20–37% of the differentially expressed genes from pairwise comparisons of parasitic females, free-living females and iL3s, and these included genes with the highest relative expression levels (Supplementary Table 13). These novel genes are likely to be important to these distinctive phases of the life cycle, including in parasitism. Three small novel gene families (sgpf-7 to sgpf-9) were predominantly upregulated in S. ratti parasitic female, with two of the genes predicted to encode predominantly secretory or membrane-targeted proteins (Supplementary Table 11). In contrast, the largest hypothetical protein-coding gene families, sgpf-1 to sgpf-6, accounted for only a small proportion (1% in both S. ratti and S. stercoralis) of all differentially expressed hypothetical protein-coding genes, suggesting that they do not have roles in parasitism.
Using gene ontology (GO) annotations to summarize the putative functions of the upregulated genes identified distinct differences between the life cycle stages of both species (Fig. 3 and Supplementary Table 16). The genes upregulated in iL3s appear to be associated with sensing the environment and with signal transduction and were the most consistent between S. ratti and S. stercoralis. The products of genes expressed in free-living females have core metabolic and growth-related roles (such as in cytoskeleton and chromatin). In parasitic stages, the dominant functional categories were proteases, consistent with the abundant astacins (Fig. 3 and Supplementary Table 16).
The products of putative parasitism genes are secreted
In parallel, we compared the somatic proteomes of parasitic and free-living females of S. ratti. Of 1,266 proteins detected overall, 569 were comparatively upregulated in parasitic females and 409 were comparatively upregulated in free-living females (Supplementary Tables 12 and 17). We found a modest overlap between the transcriptome and somatic proteome: 6% of genes upregulated in the parasitic female transcriptome were also upregulated in the proteome, and 10% of genes upregulated in the free-living female transcriptome were also upregulated in the proteome (Supplementary Fig. 5 and Supplementary Table 18). A poor concordance between transcript and peptide abundance has been reported in many systems60,61,62 and likely reflects post-translational processes that decouple protein and mRNA abundance. In the present study, this may be compounded by the excretion and/or secretion of many gene products from parasitic stages to allow these proteins to interact with the host. Indeed, 43% of genes upregulated in the parasitic female transcriptome are predicted to encode signal peptides, compared with 26% of genes upregulated in the free-living female. Furthermore, while several of the putative parasitism gene families were highly upregulated in the somatic proteome (aspartic peptidases, prolyl endopeptidases and acetylcholinesterases; Supplementary Table 17), we found only five astacin-like and no SCP/TAPS proteins (Supplementary Fig. 5). To address this, we extended the analysis to the excretory/secretory (ES) proteome data of Soblik et al.63.
In the ES proteome, we detected an additional 882 proteins and found greater consistency with the parasitic female transcriptome: 13% of the parasitic female ES proteins overlapped with the genes upregulated in the transcriptome (Supplementary Table 18). We also found 25 astacin and 14 SCP/TAPS gene products in the ES proteome. Other gene families highly upregulated in the parasitic female transcriptome were also dominant in the parasitic ES proteome, including prolyl endopeptidases, acetylcholinesterases and transthyretin-like proteins (Supplementary Table 19). Protein products of the novel gene families sgpf-1 and sgpf-5 were also identified in the ES products of both parasitic and free-living females (Supplementary Table 11). Other parasitic nematodes have been noted to have many protease-encoding genes, and different species appear to have expanded different protease families36,64,65,66. Together, these and our findings suggest that expansion of protease-encoding genes and secretion of extensive quantities of proteases is likely to be an essential feature of nematode parasitism. These proteases are, presumably, used to penetrate host tissue, acquire resources from the host and protect the parasite from host-induced harm.
Parasitism-associated genes are in coexpressed clusters
We observed that genes upregulated in the parasitic females and iL3s were often physically clustered in the genome, more so than for genes upregulated in the free-living females (Supplementary Table 20). To test whether this clustering was significant, we asked whether clusters of three or more adjacent genes, upregulated in the same life cycle stage, occurred more often than would be expected by chance. We found that 31%, 4% and 26% of upregulated genes were in such clusters in S. ratti parasitic female, free-living female and iL3, respectively, whereas in S. stercoralis this was 34%, 2% and 34% (Supplementary Table 20). This clustering is more than would be expected by chance (Supplementary Fig. 6 and Supplementary Table 20). The clusters in parasitic females were larger (19 and 16 genes in the largest S. ratti and S. stercoralis clusters, respectively) than those of the iL3 (9 and 14 genes) and free-living females (3 genes) (Supplementary Table 20). Although nematodes, including S. ratti67, have operons, these clusters are unlikely to be operons because (i) the average intergenic distance among clustered genes does not differ from the genome-wide average (Supplementary Fig. 6) and (ii) cluster members include genes on both strands.
Clusters of genes upregulated in the parasitic female were more likely to comprise genes from the same gene family. The majority (88 and 73% for S. ratti and S. stercoralis, respectively) of these parasitic female clusters were of genes belonging to the same Compara gene family; this is greater than that observed for iL3 (8–10%) (Supplementary Tables 20, 21, 22). Two gene families dominated parasitic female clusters: astacins (24 and 23% of parasitic female clusters for S. ratti and S. stercoralis, respectively) and SCP/TAPS (15 and 11%). Tandem expansions of astacin and SCP/TAPS genes could provide a plausible explanation for the preponderance of these gene families in the parasitic female expression clusters. However, even with exclusion of the astacin and SCP/TAPS families, most remaining parasitic female clusters still comprised genes from the same gene family (85 and 65% for S. ratti and S. stercoralis, respectively); fewer clusters from the same gene family occurred for iL3 (7 and 9%) compared to parasitic female (Supplementary Table 21).
Phylogenetic analysis of astacins, including those from the eight outgroup species, showed that 139 S. ratti genes form one distinct clade (Fig. 4), presumably derived from a single ancestral astacin gene. Similarly, the S. ratti SCP/TAPS gene family has almost exclusively expanded from one ancestral gene (Fig. 4). These gene clusters likely arose by tandem duplication of genes, as has occurred for other large gene families, for example in C. elegans18. However, in contrast to C. elegans, physical adjacency of the duplicated genes has been maintained in Strongyloides, perhaps as a result of the expansions being recent and therefore not yet broken up by recombination. Alternatively, the adjacency may be functional, for example, if there is pressure to maintain a common regulatory environment. Clustering of gene families was relatively rare among Rhabditophanes sp. KR3102 and the eight outgroup species (Supplementary Table 21), meaning that this clustering is specific to the Strongyloides-Parastrongyloides lineage and thus to the parasitic lifestyle in this clade.
The clusters of genes upregulated in the parasitic females were themselves chromosomally clustered, forming 'parasitism regions' (Fig. 4). In S. ratti, one-third of genes upregulated in the parasitic female are concentrated in three regions of chromosome II, most notably in a 3.6-Mb region at one end of the chromosome, comprising 171 genes that were upregulated in the parasitic female transcriptome (Supplementary Fig. 2). A similar pattern is evident in S. stercoralis, where seven scaffolds and contigs with a high density of genes upregulated in the parasitic female also belong to chromosome II; 46% of the 171 genes upregulated in S. ratti belong to just eight different gene families, including those encoding aspartic peptidases, astacin-like proteins, SCP/TAPS proteins, transthyretin-like proteins and trypsin inhibitor–like proteins. This is the first report, to our knowledge, of chromosomal clustering of genes likely to be important in nematode parasitism, and this clustering hints at possible regulatory mechanisms for parasite development.
Understanding the molecular and genetic differences between parasitic and free-living organisms is of fundamental biological interest and is essential to identifying novel drug targets and other methods to control parasitic nematodes and the diseases that they cause. We have undertaken a comparative genomics study of six taxa from an evolutionary clade that transitions from a free-living to a parasitic lifestyle, which we combined with transcriptomic and proteomic analyses of parasitic and free-living female stages of Strongyloides species. Together, this is a powerful way to discover the molecular adaptations to parasitism among these nematodes. We find that a preponderance of the genes that are expanded in parasitic species are specifically used in the parasitic stages and are within genomic clusters, concentrated in regions of chromosome II. This is consistent with the idea that the within-host stages of parasitic nematodes deploy a specific biology that enables them to be successful parasites. The Strongyloides proteome and transcriptome have limited overlap, as has been observed in other systems. For the Strongyloides clade, we find that astacin- and SCP/TAPS-encoding genes are prominent among parasitism-associated genes. Other parasitic nematodes appear to have expanded the number of protease-encoding genes in their genome, which also appear to be used predominantly during the within-host stages. In Strongyloides, we have also found genomic clustering of these and other likely parasitism-associated genes, which is likely to have been initiated during the adaptation to parasitism, followed by subsequent repeated gene duplication, associated with adaptation to different hosts. This genomic arrangement may facilitate the expression of a parasitic transcriptional program by these parasites. Operons have been demonstrated in Strongyloides, and it will be important to determine whether these parasitism-associated genes are under operonic control.
Strongyloides is a particularly amenable laboratory system—both S. ratti and S. venezuelensis can be maintained in the laboratory in their natural rat host, as well as in other rodents, and the parasite of humans S. stercoralis can also be maintained in the laboratory. In addition to providing a compelling model of the evolution of parasitism, transgenesis of Strongyloides and Parastrongyloides is possible68,69,70,71 uniquely among parasitic nematodes, which will allow functional genomic studies, directed by our findings, to further explore the genetic basis of nematode parasitism.
Parasite material, sequencing and assembly.
S. ratti, S. stercoralis, S. venezuelensis and S. papillosus larvae were obtained from fecal cultures of infected laboratory animals; for P. trichosuri and Rhabditophanes sp. KR3021, material was obtained from stages grown on agar plates (Supplementary Fig. 7) (full details on ethical approval are in the Supplementary Note). To produce the S. ratti reference genome, a combination of Sanger capillary, 454 and Illumina-derived sequence data was used, whereas data for the other species were generated using Illumina technology. The S. ratti genome was initially assembled using Newbler v.2.3 (ref. 75) (for the capillary and 454 sequence data) and AbySS v.1.3.1 (ref. 76) (for the Illumina-derived data); Illumina paired-end reads were mapped to this assembly with SMALT (H. Ponstingl, personal communication). The genomes of the other species, except S. venezuelensis, were assembled using a combination of the SGA assembler77 and Velvet78 from 100-bp paired-end Illumina reads, produced from short-fragment (∼500-bp)79 and 3-kb mate-pair libraries80. Illumina reads were used in IMAGE81 and Gapfiller82 software to fill gaps and in iCORN83 to correct base errors. Gap5 (ref. 84) was used to manually extend and link scaffolds using Illumina read pairs. Genetic markers20 were mapped to the S. ratti assembly to order and orient scaffolds and to the S. papillosus assembly to assign scaffolds to chromosomes and regions of putative chromosomal diminution. The S. venezuelensis genome was assembled using the Platanus assembler85 and improved as described above for the other species. The resulting v2 S. venezuelensis assembly was further scaffolded using an optical map produced by an Argus optical mapping platform (Opgen). CEGMA v2 (ref. 86) was used to assess the completeness of each assembly.
Assembled sequences were scanned for contamination from other species using a series of BLASTX and BLASTP87 searches against vertebrate and invertebrate sequence databases. Repeat sequences in the assemblies were characterized using RepeatModeler and TransposonPSI.
Mitochondrial genomes were assembled using the MITObim assembler88 with the C. elegans mitochondrial genes as seeds. The gene order of each assembly was confirmed by PCR. A mitochondrial protein-coding gene sequence phylogeny was constructed using RAxML v7.2.8 (ref. 89).
Identifying regions that undergo chromatin diminution or belong to the X chromosome.
To identify chromosomal regions that undergo chromatin diminution in S. papillosus and scaffolds that belong to the X chromosome in S. ratti, S. stercoralis and P. trichosuri, DNA of males and females from each species was sequenced and mapped to the appropriate reference genome using SMALT v0.7.4 (H. Ponstingl, personal communication). The read depth was calculated for each scaffold using the BedTools function genomecov90, and all scaffolds were classified as diminished/X-chromosome or non-diminished/autosomal on the basis of differences in read coverage. Because males are hemizygous for the diminished region in S. papillosus20 and for the X chromosome in the other species, a male:female read depth ratio of 0.5:1 was expected in diminished or X-chromosome scaffolds relative to autosomes, whereas in non-diminished/autosomal regions the ratio would be expected to be close to 1:1.
Gene prediction and functional annotation.
Genes were predicted using Augustus91—with a training set of approximately 200–400 manually curated genes per species, aligned transcript data and S. ratti protein sequences as hints—supplemented with non-overlapping predictions from MAKER92. If there was more than one alternative splice pattern for a gene prediction in the combined Augustus and MAKER gene set, we only kept the transcript corresponding to the longest predicted protein. Astacin gene models and a subset of SCP/TAPS gene models from S. ratti, S. venezuelensis and S. stercoralis were manually curated before phylogenetic analyses.
A protein name was assigned to each predicted protein on the basis of manually curated orthologs in UniProt93 from selected species (human, zebrafish, Drosophila melanogaster, C. elegans and Schistosoma mansoni orthologs), where possible. If a predicted protein was not assigned a protein name on the basis of its orthologs, then a protein name was assigned using InterPro94 domains in the protein.
GO terms were assigned by transferring GO terms from human, zebrafish, C. elegans and D. melanogaster orthologs using an approach based on the Ensembl Compara approach for transferring GO terms to orthologs in vertebrate species33 but modified for improved accuracy in transferring GO terms across phyla. Manually curated GO annotations were downloaded from the GO Consortium website95, and, for a particular predicted protein in the present study, the manually curated GO terms were obtained for all its human, zebrafish, C. elegans and D. melanogaster orthologs. From this set, the last common ancestor term (in the GO hierarchy) was found for each pair of GO terms from orthologs of two different species (for example, a C. elegans ortholog and a zebrafish ortholog) and then transferred to our predicted protein. GO terms of the three possible types (molecular function, cellular component and biological process) were assigned to predicted proteins in this way. Additional GO terms were identified using InterProScan96.
Gene orthology and species tree reconstruction.
Eight outgroup species were used, encompassing four previously defined nematode clades9 (clade I, Trichinella spiralis and Trichuris muris; clade III, Ascaris suum and Brugia malayi; clade IV, Bursaphelenchus xylophilus and Meloidogyne hapla; clade V, Necator americanus and C. elegans), together with the six species from the present study to construct a Compara database using the Ensembl Compara pipeline33. The database was used to identify orthologs and paralogs, gene duplications and gene losses, as well as gene families shared among the species or subsets of the species or specific to one species.
In total, 4,437 gene families were identified that contained just one gene from each species and that were present in at least ten species out of the six species and the eight outgroups. An alignment for the proteins in each family was built using MAFFT version v6.857 (ref. 97), poorly aligning regions were trimmed using GBlocks v0.91b and the remaining columns were concatenated. For each alignment, the best-fitting amino acid substitution model was identified as that minimizing the Akaike Information Criterion from the set of models available in RAxML v8.0.24 (ref. 89), testing models with both predefined amino acid frequencies and observed frequencies in the data, and all with the CAT model of rate variation across sites. A maximum-likelihood phylogenetic tree was constructed on the basis of the concatenated alignment, with each protein alignment an independent partition of these data, applying the best-fitting substitution model identified above to each partition. This inference used RAxML v8.0.24 with ten random addition-sequence replicates and 100 bootstrap replicates and otherwise used default heuristic search settings.
Analysis of intron-exon structure and synteny analysis.
Introns that were present in two or more species were identified from gene structures and full-gene nucleotide alignments of 208 single-copy orthologs using ScipPio98 and GenePainter99. The output from GenePainter was parsed into DOLLOP (PHYLIP package; see URLs) to infer intron gain and loss on every node of the species tree using maximum parsimony.
Whole-assembly nucleotide alignments were produced between S. ratti and the other five species using nucmer100. Each scaffold from the other species was assigned a chromosome on the basis of its nucmer alignment to an S. ratti chromosome. To identify syntenic regions, conserved blocks of three consecutive orthologous genes or more in the same order and orientation were defined by DAGchainer101, between the S. ratti reference and each of the other five species. To gain a high-level view of synteny, PROmer102 was used to identify very highly conserved sequence matches, on the basis of translated sequence, after which scaffolds from a particular species were ordered by matching to S. ratti chromosome and position in that chromosome and the matches were plotted using Circos103.
Transcriptome and proteome analyses.
For S. ratti and S. stercoralis, the transcriptomes were compared from the parasitic female, free-living female and iL3s; we note that parasitic and free-living adult females will have eggs in utero. For S. ratti, free-living females were picked individually from cultures of S. ratti–infected rat feces, from which iL3s were also collected; parasitic females were collected by dissection of S. ratti–infected rats104. Two biological replicates were collected for parasitic and free-living females. These samples were divided approximately equally and used for both transcriptomic and proteomic analysis. A single biological sample was used for iL3 transcriptomic analysis. RNA was prepared from TRIzol and selected for poly(A) RNA with Dynabeads, acoustically sheared and reverse transcribed to construct Illumina libraries that were sequenced. For S. stercoralis, we used previously published data42. RNA-seq data were analyzed using R v.3.0.2 and the Bioconductor package edgeR105 to identify genes differentially expressed in all pairwise combinations of the three life cycle stages.
For S. ratti, the proteome was also compared between the parasitic and free-living females. Equivalent samples of the material collected for the transcriptome analyses were used. Protein was extracted by freeze-thaw cycles, mechanical grinding and chemical extraction and digested with trypsin. The resulting peptide mixture was analyzed by liquid chromatography–mass spectrometry. Proteins were identified and quantified using Progenesis. For downstream analyses, at least two unique peptides were required to identify proteins. Protein abundance (iBAQ) was calculated from Progenesis.
For both the transcriptome and proteome data, GO analysis was performed in R using TopGo v.2.16.0 and Fisher's exact test.
For the analysis of the ES proteome63, converted raw spectral files were analyzed by the Mascot search engine, where an FDR <1% and a minimum of two significant peptides were required to identify proteins. Protein abundance was calculated from the Mascot algorithm emPAI.
Astacins and SCP/TAPS.
Genes encoding astacins and SCP/TAPS were identified using InterProScan. For these gene families, we aligned amino acid sequences of the members from all S. ratti and eight outgroup species using MAFFT97. The alignments were edited with TCS106 using the weighted option, and the distance matrix of the new alignment was calculated using ProtTest107. The phylogenetic tree was constructed by maximum likelihood using RAxML89 with 100 bootstrap replicates.
Clusters of genes were identified as three or more adjacent genes upregulated in the same stage of the life cycle. The members of a cluster were considered to share a common gene family where ≥50% of the genes belonged to the same Compara gene family. To investigate the number of clusters expected by chance for a particular life cycle stage, for n genes upregulated in a particular stage, we randomly selected n genes from the genome and calculated the number of clusters seen for the n random genes; this was repeated 1,000 times and the mean value was calculated.
WormBase-ParaSite, http://parasite.wormbase.org/; World Health Organization soil-transmitted helminthiases, http://www.who.int/gho/neglected_diseases/soil_transmitted_helminthiases/en/; World Health Organization estimates of disease burden 2000–2012, http://www.who.int/healthinfo/global_burden_disease/estimates/en/index2.html; RepeatModeler, http://www.repeatmasker.org/RepeatModeler.html; TransposonPSI, http://transposonpsi.sourceforge.net/; SMALT, http://www.sanger.ac.uk/resources/software/smalt/; PHYLIP package, http://evolution.genetics.washington.edu/phylip.html.
The S. ratti, S. stercoralis, S. papillosus, S. venezuelensis, P. trichosuri and Rhabditophanes genome assemblies, predicted transcripts, protein and annotation (.GFF) files are available from WormBase-ParaSite and are registered with the European Nucleotide Archive (ENA) under BioProject accessions PRJEB125 (S_ratti_ED321_v5_0_4), PRJEB528 (S_stercoralis_PV0001_v2_0_4), PRJEB525 (S_papillosus_LIN_v2_1_4), PRJEB530 (S_venezuelensis_HH1_v2_0_4), PRJEB515 (P_trichosuri_KNP_v2_0_4) and PRJEB1297 (Rhabditophanes_sp_KR3021_v2_0_4). The raw genomic data are available from the ENA via the accessions detailed in Supplementary Table 23. The transcriptomic data for S. ratti are available from ArrayExpress under accessions E-ERAD-151 and E-ERAD-92. For S. venezuelensis, transcriptomic data are available from the DNA Databank of Japan (DDBJ) under BioProject accession PRJDB3457 (S. venezuelensis) (Supplementary Table 24).
We thank, from the Wellcome Trust Sanger Institute, C. Griffiths, D. Willey, R. Rance and DNA Pipelines; J. Keane and D. Gordon for bioinformatics support; M. Dunn for the S. venezuelensis optical map; A. Babbage for laboratory support; and M. Zarowiecki for gene finding and functional annotation advice. We thank for technical help L. Hughes and L. Weldon (University of Bristol); H. Massey, Jr., X. Li and H. Shao (University of Pennsylvania); D.K. Howe and R.I. Wernick (Oregon State University); H. Denise (European Bioinformatics Institute); M. Yabana (Tokyo Institute of Technology); and A. Hino and R. Tanaka (University of Miyazaki) and A. Toyoda (National Institute of Genetics) for sequencing. The S. ratti transcriptome and proteome work was funded by Wellcome Trust grant 094462/Z/10/Z awarded to M.V., J.W. and M.B. The S. ratti, S. stercoralis, S. papillosus, P. trichosuri and Rhabditophanes sp. KR3021 genome sequencing and the S. venezuelensis optical mapping were funded by Wellcome Trust grant 098051. The S. venezuelensis work was supported by Japan Society for the Promotion of Science (JSPS) KAKENHI (24310142, 21590466 and 24780044), KAKENHI for Innovative Areas 'Genome Science' (221S0002) and the Integrated Research Project for Human and Veterinary Medicine of the University of Miyazaki. I.J.T. was supported by Academia Sinica. Work was funded by grants AI050668 and AI105856 from the US National Institutes of Health (NIH) to J.B.L. and by Resource-Related Research Grant RR02512 from the US NIH to M. Haskins, which provided research materials for the study. J.D.S. received support from US NIH training grant AI060516. A.K. was supported by a predoctoral stipend from the Max Planck Society. Work by A.K., D.H. and A.S. was funded by the Max Planck Society.
Chromosomal regions that undergo chromatin diminution or belong to the X chromosome.
Compara gene families of the six species and eight outgroup species.
Astacin-like metallopeptidases and SCP/TAPS.
Results of edgeR analysis of differential gene expression in S. ratti and S. stercoralis.
Enrichment of gene ontology annotation terms among differentially expressed genes of S. ratti and S. stercoralis.
Results of LC-MS proteome analysis of S. ratti.
Analysis of the excretory/secretory (ES) proteome of S. ratti.
Results of analysis of gene clusters.
About this article
Trends in Parasitology (2019)