Main

Currently, no complete genome sequence information exists from lineages spanning the phylum Nematoda (Supplementary Fig. 1). Yet, such information is essential in understanding the evolution of the Nematoda analogous to the way that a basal chordate informed our understanding of vertebrate evolution1. To this end, we generated the genome sequence of T. spiralis, a food-borne, zoonotic parasite, to reveal molecular characters and evolutionary trends between this organism, evolutionarily distant parasitic and non-parasitic nematodes and a member of the next closest sequenced relatives, the arthropods. In so doing, we identified commonalities that link nematodes to other Metazoa members, as well as distinctions that define the Nematoda and differentiate T. spiralis from the other species investigated. The Trichinella assembly is 64 million bp in length and encodes at least 15,808 proteins, which makes this genome substantially smaller than that of the prototypical nematode, C. elegans.

Trichinellosis is a worldwide zoonotic disease. The nematode T. spiralis, the most common cause of human trichinellosis, is a member of a clade that diverged early in the evolution of the Nematoda. It differs substantially in biological and molecular characters from other crown groups2,3,4. The lineage giving rise to the genus Trichinella last shared a common ancestor approximately 275 million years ago (Lower Permian Period), whereas the diversification of extant Trichinella species occurred as recently as 16–20 million years ago (Miocene Epoch)5.

The life cycle of Trichinella spp. (Supplementary Fig. 2) begins when muscle tissue containing first-stage larvae (ML) is ingested by the new host. The ML rapidly develop into adults in the intestine, where they mate and produce newborn larvae (NBL). The NBL migrate from the intestines through the lymphatic system, eventually to the blood, and then they invade striated skeletal muscle cells to complete the cycle and become infectious to the next host. Intense inflammation is a primary cause of disease and involves myositis, myocarditis and encephalitis, the intensity of which depends on the number of parasites ingested. Currently, the genus consists of eight distinct species and/or genotypes that are further categorized as encapsulated or non-encapsulated, predicated upon the formation of a collagen envelope around the infected muscle cell. This capsule is believed to be a host-derived structure induced only by species that infect placental mammals and is unique to this genus. In addition to the formation of a collagen capsule, and contrary to most other parasitic nematodes, T. spiralis shows little host specificity among mammals, completes its entire life cycle in a single host, does not have a free-living stage and lives as an intracellular parasite within a single striated muscle cell. As such, this genus presents biological characteristics that markedly differ from what is common among most other nematodes.

Herein we compare the molecular characteristics of nematodes and other metazoans using the entire T. spiralis genome. This comparative approach allowed us to identify conserved protein and gene sequences with apparent archetypical standing for the phylum Nematoda. We found that intrachromosomal rearrangements were common throughout the phylum; however, this was in contrast to other characters such as protein family deaths and births, which showed a clear demarcation between a parasitic and non-parasitic nematode. In addition, unlike in D. melanogaster, the levels of gene loss and gain in each nematode species indicate that these events may have played a substantially larger role in the evolution of this phylum. The identification of these and other conserved characteristics, predicated in part upon this work, will advance more targeted research on pathogens from a phylum harboring thousands of pathogens that infect humans, animals and plants. The advances may one day provide holistic strategies to treat and control diseases caused by pathogens from across the Nematoda.

Results

Sequencing, assembly and gene organization

Data were generated from whole-genome shotgun sequencing and hierarchal map–assisted sequencing6. The assembly totaled 64 Mb (Online Methods, Supplementary Note and Supplementary Table 1), which is in line with recent genome size estimates made by flow cytometry (1 C = 71 Mb)6,7. The data provided a coverage level of 35-fold, with 15% of the supercontigs encompassing 90% of the genome. The T. spiralis fingerprint clone map enabled construction of 9 ultracontigs comprised of 69 supercontigs representing 49 Mb, or 76% of the genome.

The repeat content of the T. spiralis genome is estimated at 18%. The repeats have a low GC content (27%) relative to the genome overall (34%) and to protein coding regions (43%). The 15,808 protein-coding sequences occupy 26.6% of the genome at an average density of 272 genes per Mb. Although 15% of C. elegans genes are organized in operons8, spatial relationships of genes in T. spiralis do not readily support the existence of operons (Supplementary Note). This observation validated prior studies indicating similar findings4. As such, the existence of operons in this nematode remains an open question. Further, T. spiralis lacks both the canonical SL1 trans-spliced leader found in most nematodes and the SL2 trans-spliced leader that is spliced onto transcripts from downstream genes in C. elegans operons. To date, at least 15 distinct spliced leaders encoded by 19 SL RNA genes have been identified in T. spiralis4; however, these putative splice leaders show sequence variability at nearly all base positions, and we found them to be present in only 1% of the complementary DNAs (cDNAs) examined. It is likely, therefore, that the canonical SL1 and SL2 spliced leader sequences were not part of the genetic repertoire in nematodes that diverged early in the evolution of the Nematoda. This hypothesis is supported in part by our inability to identify canonical SL1 and SL2 sequences among Trichuris muris expressed sequence tags (EST) as well (data not shown). After comparison to an extensive collection of proteins from other species, 45% (7,251) of the predicted protein coding genes were T. spiralis specific, of which 12% had EST confirmation (Supplementary Fig. 3). The amino acid composition of the predicted proteins in T. spiralis is similar to that observed in other nematodes9, organisms (Supplementary Table 2) and taxa10. In agreement with previous studies11, nematodes show a correlation between amino acid usage and the degree of codon degeneracy (R = 0.74).

Genome evolution

The availability of the genome from a member of Dorylaimia expanded our abilities to evaluate genome evolution among highly divergent crown clades and to potentially identify factors underlying lineage diversification. We evaluated changes associated with nematode evolution in relation to (i) genome organization; (ii) births and deaths of gene families; (iii) gene duplications and deletions that have occurred within gene families; and (iv) linear organization of orthologous genes.

We evaluated organizational characteristics by comparing the genomes of T. spiralis and C. elegans. The number of predicted genes in T. spiralis is notably lower than the 20,140 genes identified in C. elegans, even though the two genomes show similar repeat content and gene density. A comparison of approximately 3,400 predicted orthologous genes (based on reciprocal best BLAST hits) showed that T. spiralis has a significantly shorter average intron size (191 bp compared to 391 bp; P = 6.5 × 10−69) amidst an average exon size that is relatively similar for the two species (179 bp for T. spiralis and 226 bp for C. elegans; P = 7.0 × 10−3). Focusing only on predicted orthologous genes with 20 or more exons, the mean total length for all exons was significantly higher in C. elegans (P = 0.001). Comparisons of the domains in the Pfam database that are contained in orthologous pairs showed that C. elegans had significantly more domains compared to the orthologous T. spiralis genes (876 genes compared to 755 genes; P < 0.01). These differences coincide with the smaller size of the T. spiralis genome; however, we cannot rule out the possibility for higher numbers of gene fragments in T. spiralis resulting from less refined genome annotation.

Delineating gene family emergence and extinction within phylogenetically related organisms can identify molecular determinants that underlie species (and pathogen) adaptation and lineage or species evolution. Such an approach has been used in analyzing nematode EST12,13,14. Here we measured potential emergence and extinction events of protein families across Nematoda. The analysis included species from four major lineages that collectively span the phylum (C. elegans, Meloidogyne incognita15, Brugia malayi16 and T. spiralis). These species represent nematodes that are non-parasitic, parasitic in plants and parasitic in animals, respectively, thus representing diverse trophic ecologies. Arthropod (D. melanogaster17) and yeast (Saccharomyces cerevisae18) species were used as outgroups. Markov clustering19 of the complete protein catalog (87,406 proteins) comprising all six species generated 12,163 protein families (Supplementary Table 3). Inter-specific protein families overlaid onto species phylogeny identified 702 protein families at the node between Nematoda and the outgroups (Fig. 1a and Supplementary Table 4). Of these nematode families, 274 families were common among all four members of the Nematoda. We screened the genes in the 274-family core nematode group (1,990 genes) against all available nematode ESTs and cDNAs and found 73% shared homology to nematode transcriptome data from 27 nematode genera and only 5% shared sequence homology to arthropods using the same cutoff value. These numbers do not preclude gains that may have occurred before the appearance of the Nematoda or gains relative to Drosophila that may still be present in other arthropods. In contrast, we identified 88 protein family deaths as common among the four nematodes relative to D. melanogaster. Protein family deaths outnumbered births for all three parasitic species, whereas in the non-parasitic species C. elegans, births outnumbered deaths four to one. The methods used here will allow future assessment of this tendency with the availability of additional genomes from other parasitic and non-parasitic nematodes. We observed emergence of new protein families in all nematode lineages, albeit less so in B. malayi. Accordingly, it is now possible to explore the relevance of protein families identified in the evolution of lineages within the Nematoda and across phyla.

Figure 1: Protein and gene family changes associated with the origin and evolution of Nematoda.
figure 1

(a) Protein family changes. At the branch of each lineage, the '+' number indicates family birth events, and the '−' number indicates family death events represented by all members indicated for that lineage. For example, there are 702 protein family births ancestral to the phylum Nematoda and 88 protein family deaths in common among the four nematodes in comparison to arthropods (represented by D. melanogaster). We reconstructed these events from 12,206 interspecific orthologous families (63,273 proteins). (b) Gene duplications and losses over the evolution of the common protein families. We reconstructed the gene duplication and loss events using 858 orthologous multi-member protein families (containing 8,260 proteins) conserved among all six species. At the branch of each lineage, the '+' number indicates the number of gene duplication events, and the '−' number indicates the number of gene loss events for that lineage.

Similarly, quantitative changes in protein family members (duplications and deletions) can reflect evolutionary determinants of lineage and species diversity. We evaluated 858 families (8,260 genes) common to the four nematode species and two outgroup species defined above (Fig. 1b); 674 families had no obvious duplications or deletions, 70 had only deletions, 105 had only duplications and 9 had both. Nematode species had higher numbers of events compared to D. melanogaster (Fig. 1b). Among the nematodes, M. incognita had the highest number of both duplications and deletions, likely due to 30% of the genome being duplicated, resulting in more species-specific events15. An example for T. spiralis involves the secreted DNase II–like protein family, a member of which has been evaluated as a vaccine candidate20 and which has been implicated in host-parasite interactions. The genome shows more extensive expansion of this family (an estimated 125 genes) than previously realized (Supplementary Note and Supplementary Fig. 4).

To provide additional examples, we compared protein families in C. elegans with sequence homologs in T. spiralis. Ten families were relatively expanded and five families were contracted in T. spiralis (P < 0.001) (Supplementary Table 5). These families can be grouped into (i) those present before the separation of nematodes and arthropods (nine families) and (ii) those putatively born coincident with this separation (six families) and possibly the origin of nematodes. The six protein families in this later group included four that are relatively expanded in T. spiralis: a retrotransposon (2:201 Ce (C. elegans): Ts (T. spiralis)); a translation initiation factor 2C, putatively related to lipid metabolism (2:140 Ce:Ts); a zinc finger C2H2 type protein (1:14, Ce:Ts); and a hypothetical protein (1:44, Ce:Ts) associated with defective egg laying in C. elegans. Two protein families are relatively contracted in T. spiralis: a major sperm protein (33:1, Ce:Ts) and a protein of unknown function, DUF1647 (18:1, Ce:Ts).

Comparisons of orthologous protein families outlined in sections two and three facilitated assessment of a nematode genome (T. spiralis) from a basally positioned clade (clade 2) with those from highly divergent clades (clades 8, 9 and 12)21 and an outgroup member (D. melanogaster). Results consistently demonstrated similar and extensive levels of disparity in orthologous family sizes between T. spiralis and either C.elegans or D. melanogaster, whereas members of clades 8, 9, and 12 showed higher levels of shared attributes with C. elegans only (Fig. 2). Information in the next section provides independent measures, based on genome organization, to support this data which previously was indicated by rRNA sequence comparisons21.

Figure 2: Comparison of orthologous protein families among nematodes that span the phylum.
figure 2

Orthologous families comprised of each of the three parasites and D. melanogaster and C. elegans are plotted separately. The size of the dot represents the size of the orthologous family; the position represents the composition of the family based on the three represented species. With the assumption that evolutionarily close species have similar orthologous family size (fewer duplications and deletions), these plots illustrate that T. spiralis is equally distinct from both C. elegans and D. melanogaster, whereas the two other parasites share greater commonality with C. elegans. P values (derived using a χ2 test in pairwise plot comparison) indicate a greater number of families present in C. elegans compared to D. melanogaster and show that statistically significantly (P < 1 × 10−5) fewer families are biased to C. elegans when T. spiralis is present in the orthologous family.

Next we evaluated the nematode genomes across the phylum regarding extent and limits to evolutionary changes and functional associations that may depend on gene arrangements. Comparisons between C. elegans and B. malayi (350 million years of separation) indicated that intra-rather than inter-chromosomal rearrangements preferentially characterize genome evolution evident between these species16. We used the T. spiralis genes organized on the six longest ultracontigs to extend this analysis. For B. malayi, T. spiralis genes showed macrosyntenic relationships with predicted orthologs from C. elegans (P < 0.0001), albeit to a lesser extent (Fig. 3a). Because the X chromosome in T. spiralis is diploid only in females of these species (female 2n = 12 (XX) and male 2n = 11 (XO)), we also calculated the correlation coefficient when the X chromosome was excluded. This resulted in improved support for macrosynteny. This non-random distribution of orthologous genes is consistent with that observed in several nematode species22,23,24.

Figure 3: Genes from T. spiralis show macrosyntenic relationships with predicted orthologs from other nematodes.
figure 3

(a) T. spiralis genes on the six largest ultracontigs with orthologs in C. elegans, colored to indicate the C. elegans chromosome on which the ortholog is located. The correlation was strong (R = 0.95, R = 0.76 and R = 0.99) and was even stronger when the X chromosome was excluded (R = 0.97, R = 0.97 and R = 0.99). For example, R = 0.95 indicates that genes from both T. spiralis ultracontigs 1 and 4 are strongly associated with one predominant C. elegans chromosome, chromosome 3, and this organization is not a result of random gene distribution. (b) Orthologous segments shared among nematode species shown on the C. elegans chromosomes. Red segments are considered to be ancestral orthologous segments among nematodes. The size of segments corresponds to the C. elegans orthologous segment that may be different than the orthologous segment in the other two species (Supplementary Table 7).

Assuming a constant tendency toward randomness, genome reassortment is expected to occur at a rate commensurate with evolutionary distance. Using syntenic blocks of C. elegans for standardization, we measured the dynamics of nematode chromosome reassortment among multiple nematode pairs25. We observed the highest syntenic conservation score between C. elegans and C. briggsae (0.752), less so between C. elegans and B. malayi (0.508) and the least between C. elegans and T. spiralis (0.28) (Supplementary Table 6). Because sequences for non–C. elegans genomes have varying levels of fragmentation, it was not possible to use entirely complementary gene sets in the pairwise comparisons (we did not consider orthologous genes on different scaffolds). Nevertheless, the relative syntenic conservation values were consistent with the perceived evolutionary distance of the species investigated. The approximate 72% of the T. spiralis genome organization that lacked demonstrable congruence with the C. elegans genome provided a tentative estimate on the limits of evolutionary diversity of this kind across the Nematoda.

Despite an anticipated tendency toward randomization, the existence of syntenic blocks suggests functional constraints to genome evolution. We investigated this possibility with a high-level orthology map created with coding exons as anchors26 from C. elegans, B. malayi and T. spiralis. We identified 196 orthologous segments (Supplementary Table 7); 155 of these segments were shared among C. elegans and B. malayi, 5 were shared among B. malayi and T. spiralis and 36 were shared among all three species, putatively defined as ancestral orthologous segments. No segments were shared exclusively between C. elegans and T. spiralis (Fig. 3b). These results are again consistent with the perceived evolutionary distance among these organisms based on all pairwise comparisons. The genes within the 36 ancestral segments accounted for 50% of the genes in all segments for C. elegans and B. malayi but accounted for 97% of the genes in T. spiralis. Over half of the ancestral segments are located on C. elegans chromosomes 3 and 4. These ancestral segments tended to localize more centrally in the chromosomes (P = 0.001)27. This tendency was also suggested by the two-species orthologous segments, although it was less evident there (different at P = 0.1). The overall patterns highlighted likely reflect basic properties that influence the evolution of genome organization in nematodes.

The nematode species from the lineages evaluated span recent and early radiation events within the phylum Nematoda. Hence, the quantitative and qualitative measures of genomic diversity will help to define both the extent and limits of genome organizational diversity across the Nematoda and help clarify molecular determinants of nematode lineages and species. Nevertheless, the results based on Markov clustering of predicted orthologous protein families will exclude other forms of diversity such as nucleotide substitutions, insertions and deletions. As such, the documented differences reflect but a small component of the total genomic diversity within the Nematoda.

Molecular determinants archetypical of the phylum Nematoda

We evaluated the molecular determinants for traits that characterize the archetypical nematode12,14. To identify proteins and protein sequences that are broadly conserved among the four nematodes that span the phylum, we further compared worm-derived proteins to those of arthropod and yeast outgroups. The 12,163 orthologous protein families were partitioned into (i) orthologous protein sequences that are broadly conserved among all of the four nematode species and any of the two outgroups (2,517 families, 14,801 nematode proteins); (ii) those conserved exclusively among the four nematodes (274 families, 1,990 nematode proteins); and (iii) those that are conserved between any nematode and any outgroup (4,980 families, 30,729 proteins) (Supplementary Table 3). We evaluated 328 protein families represented by a single copy gene in all six species by querying the C. elegans database for RNAi phenotypes. The exclusion of multi-member protein families from this evaluation precluded cases where compensation by other family members might obscure RNAi phenotypes. Of the 328 C. elegans genes, 232 (71%) had associated RNAi phenotypes (significant enrichment at P < 0.00001) consistent with a gene set essential to core cellular and biochemical functions of eukaryotes (Supplementary Table 8).

Of the 2,517 nematode protein families (Fig. 4), 274 were detected in all four nematodes only (Supplementary Note), and we refer to these collectively as Nematode Orthologous Groups (NOGs) (Supplementary Table 9 and Supplementary Fig. 5). These NOGs were significantly enriched (P < 0.00001) for genes with RNAi phenotypes in C. elegans and likely represent a gene set essential to core cellular and biochemical functions of nematodes.

Figure 4: Distribution of orthologous families among the four nematode representatives spanning the phylum Nematoda.
figure 4

The lineages represented in the Nematoda are Rhabditida (C. elegans), Tylenchina (M. incognita), Spirurina (B. malayi) and Dorylaimia (T. spiralis). The trophic ecology of each of the four nematode species used in this study for the pan-phylum analysis is indicated next to the species name. The 2,517 orthologous groups are conserved in all four nematodes. Sixty-four orthologous groups are conserved among the parasitic species but not in the free-living C. elegans. The enrichment of functional categories related to certain orthologous groups compared to the complete functional repertoire for the four nematode species is presented in Supplementary Tables 8 and 9.

The 274 NOGs encoded 189 multi-copy gene families and 85 single-copy gene families (scNOGs). Sixty-eight of the scNOGs had RNAi information and 21 had observable RNAi phenotypes (Table 1 and Supplementary Table 9). There was no enrichment of RNAi phenotypes in the C. elegans genes in scNOGs compared to all C. elegans genes (P < 0.05). Nevertheless, among the 21 genes with phenotypes, 8 had known tissue localization and only 1 was neuronal. Of the remaining 64 genes, 17 had known expression patterns, of which 10 were neuronal. Therefore, the biological importance of the scNOGs may be underestimated by RNAi information because nervous tissue is relatively insensitive to RNAi (for example, see ref. 28).

Table 1 Pan-phylum single-copy genes with the C. elegans ortholog having severe RNAi phenotype

Nematode-specific amino acid sequences in scNOG proteins may have practical importance for functional investigations. As such, we evaluated the scNOGs sequences for molecular features by forced alignment with non-nematode homologs (human, chicken, frog and zebrafish) associated with the same Pfam entries. We categorized the scNOGs into two groups: (i) those involving nematode-specific insertions and deletions (InDels) (for example, see ref. 29) relative to non-nematode homologs (15 proteins) (Supplementary Fig. 6a) and (ii) those involving unique patterns of conservation independent of InDels (70 proteins) (Supplementary Figs. 6b and 7) (for example, see ref. 14). Sequence variation exclusive of conserved motifs was generally higher among the nematode proteins than among the vertebrate proteins, even though evolutionarily, each comparison spanned similar predicted lengths of time, consistent with a previous report30 (Supplementary Fig. 8). Therefore, pan-Nematoda–specific conservation has persisted despite the high evolutionary rate in adjacent sequences of these NOGs.

The nematode-specific amino acid sequences in NOGs may have fundamental importance across the Nematoda. For instance, the predicted subunit of an electron transfer complex (Supplementary Fig. 6a) has well-defined insertions, and a severe RNAi phenotype is associated with the C. elegans member of this NOG. As such, comparative information from the vertebrate homolog may guide experiments to dissect the functional roles of the NOG insertions. Furthermore, a sequence containing amino acid insertions in one protein interaction partner may be compensated by deletions in the other protein interaction partner. We indeed identified that the interaction partner of the complex to which that protein belongs (long chain Acyl-CoA dehydrogenase, with which interaction has been confirmed experimentally31) has deletions in the non-nematode protein (Supplementary Note, Supplementary Figs. 9 and 10).

This series of analyses identified genes and proteins that may have fundamental importance in all nematode species. Two categories of nematode-specific sequences are responsible for delineation as scNOGs. Therefore, scNOGs, and most likely other NOGs, contain pan-phylum nematode-specific sequences incorporated either into universally conserved protein structures or into protein structures that are unique to the Nematoda. Evidence reflecting biological importance highlights the potential for NOGs to serve as targets for control of parasitic nematodes that infect humans, animals and plants while potentially limiting risk to the host.

Core- and phylogenetically-restricted functional categories

A question of central importance is whether or not parasitic nematodes (and potentially other parasites) have evolved independently or have preferentially retained common solutions to challenges of parasitism despite their exploitation of widely divergent trophic ecologies (for example, see ref. 32). Much interest in this context has focused on (i) secretory proteins, (ii) molecular functions and (iii) biochemical pathways that are conserved or taxonomically restricted.

Although not all secretory proteins from parasitic nematodes are involved in interactions with the host, constituents of this protein category are prime candidates for examining the host-pathogen interface. Here, we sought proteins that are broadly conserved among nematodes or among parasitic nematodes. We sorted these proteins into orthologous protein groups shared among species representing diverse parasite lineages and then subgrouped them into those with secretory peptides (Supplementary Fig. 11). We interrogated predicted secretory protein orthologs with previously identified secreted proteins using an orthogonal approach based on excretory-secretory products in T. spiralis and B. malayi identified by tandem mass spectrographic analysis33,34. We identified only two proteins as secretory and common to each parasite member (including vertebrate and plant parasites) but which were absent from the non-parasitic C. elegans: (i) a serine peptidase member of the prolyl oligopeptidase family that can be critical for invasion of the mammalian host cells by protozoan parasites35 and (ii) a cyanate hydratase that in other organisms hydrolyzes and detoxifies environmental cyanate36. Our results suggest that the number of conserved secretory proteins broadly involved in nematode interactions with hosts may be relatively few. Nevertheless, this number is likely to increase when reducing our analysis to subgroupings of parasitic nematodes, as we found when we interrogated proteomes for any two of the three parasitic species here.

Among the T. spiralis genes analyzed, 35% (5,456 out of 15,808) could be assigned one or more Gene Ontology (GO) terms. We assigned putative molecular functions to 90% of this 35%, biological processes to 68% and cellular components to 45%. The remaining two-thirds of the genes in T. spiralis represent uncharacterized and possibly new functions in the parasite. A set of 25 molecular functions were significantly enriched or depleted (at P < 0.01) when we compared intra- or inter-specific orthologous groups to the complete repertoire of GO terms for T. spiralis (Supplementary Table 10 and Supplementary Fig. 12). Among the orthologous families confined only to T. spiralis and C. elegans, rhodopsin-like receptor activity was enriched, a possible consequence of the number of genes involved in G-protein–coupled receptor protein signaling pathways. In orthologous groups with members only from T. spiralis and B. malayi, the enriched category involved steroid-binding proteins.

Among a total of 71 molecular GO categories identified, 42 were enriched and 29 were depleted in the 2,517 nematode orthologous families (including C. elegans) by comparison to the complete proteomes of the four nematode species (Supplementary Table 11). When considering the 64 orthologous groups conserved among the three parasitic nematodes, nine GO categories were statistically enriched or depleted; ATP binding was the only depleted category, whereas DNA- and RNA-binding, aspartic-type endopeptidase and prolyl oligopeptidase activities were among those enriched (Supplementary Table 12). Therefore, commonalities in molecular functions may exist even among parasites from widely diverse ecological niches. Further light will be shed on genetic associations among parasitic and non-parasitic nematodes as more robust comparisons among species from each category begin to surface.

Guided by the possibility that parasitic nematodes undergo reductive genome evolution because of reliance on the metabolic capacity and homeostatic buffering of their host, we compared T. spiralis genes encoding enzymes to similar genes from the other parasites and the non-parasitic C. elegans37,38 (Supplementary Fig. 13) and the NemaCyc viewer (Supplementary Fig. 14). We found that the parasitic species had fewer KOs (Kyoto Encyclopedia of Genes and Genomes (KEGG) database orthology) associated with their genes (522–548) compared to C. elegans (704) (Table 2 and Supplementary Table 13). The number of genes correlated with the number of associated KOs. Therefore, we examined the KOs in relation to nematode lineages used in this study. Among the 785 KOs associated with the nematode species evaluated herein, 337 were shared among all four species (core nematode KOs, CNKs). The pathway that had most of the KOs as CNKs was the energy metabolism pathway (53% of all KOs were conserved across all four species), and the pathway with the least KOs was the metabolism of cofactors and vitamins pathway (34% of the KOs were in all four species). Among the energy metabolism pathways, there were 96 KOs related to oxidative phosphorylation, 52 of which were conserved among all four nematodes. This result supports previous observations in which parasite enzymes involved in oxidative phosphorylation exhibited sequence divergence from similar host proteins. These differences were largely associated with nematode-specific insertions14,29. Despite the high level of conservation, the number of CNKs among all four nematodes was very low (34%), suggesting that different adaptations distinguish nematodes with distinct modes of existence.

Table 2 Genes and KEGG Orthologies (KOs) represented in metabolic pathways in four nematodes

Discussion

Here we present the genome sequence of T. spiralis, a member of Dorylaimia and a lineage that diverged early in the evolution of the phylum Nematoda. The draft sequence of T. spiralis covered over 90% of the estimated genome and expected genes. Coupled with genomes from nematode lineages depicting more recent episodes of divergence, the T. spiralis data provide new perspectives on genomic evolution that more broadly spans the Nematoda.

The T. spiralis genome sequence and the accompanying genome-mining analysis address four key issues. First, details of genomic diversity that were deduced among species have outlined molecular determinants, where the magnitude of change likely reflects molecular elements that have figured decisively in both the lineage and species evolution of Nematoda (for example, see refs. 39,40,41). It has been argued that such drastic differences can be related to functional diversification, speciation and species adaptation. Given the modest number of nematode species with available genomes, we fully expect that as additional nematode genome sequences become available, a much greater resolution of differences will occur. Nonetheless, the results presented here helped resolve many specific genomic characteristics that can be further investigated in this context. Second, host characteristics may select for common parasite characteristics of otherwise widely disparate nematode species. The similarities in the steroid-binding protein family common to two parasites of humans and mammals, T. spiralis and B. malayi, were distinct from a large family of related nuclear hormone receptors in C. elegans, many of which are homologous to steroid-binding receptors in other organisms42. This distinction provides support for convergent enrichment of common steroid-binding receptors in these two parasites of humans and other mammals, possibly dictated by characteristics of the host environment, as previously suggested43. Third, the new databases guided discovery of genes and proteins that appear to have fundamental importance to all nematode species (archetypical characteristics). Accordingly, the NOGs were significantly enriched for genes with RNAi phenotypes in C. elegans. Success in circumscribing archetypical nematode characteristics from pan-phylum databases will serve to refocus research on characteristics that have the broadest application for controlling pathogens of humans, animals and plants. Fourth, these results provide a valuable resource to investigate the biology of the intracellular pathogen T. spiralis. One example involves a DNase II gene family of T. spiralis, which includes secreted proteins previously implicated in host-parasite interactions and immune control20. The curious expansion and diversification of this family in comparison to other nematodes can now be related to unique characteristics of T. spiralis and possibly the lineages it represents. A second example centers on why species within this genus have separated into those that generate protective capsules from those which do not, a characteristic which is not host related. There are innumerable anticipated applications of the genome data toward elucidating the biology, methods for immune control and treatments of this parasite. The comparative value of this genome sequence will extend these applications well beyond this species and phylum.

URLs.

RepeatMasker, http://repeatmasker.org/; RNAmmer, http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?rnammer; Rfam database, http://selab.janelia.org/software.html; BER, http://ber.sourceforge.net/; PHYLIP, http://evolution.genetics.washington.edu/phylip.html.

Methods

Sequencing, assembly and annotation.

Rats were infected orally with muscle tissue containing first stage larvae (ML) of T. spiralis strain ISS 195. Infections were allowed to precede a minimum of 30 days, then the muscle tissue was digested and the parasite was collected. Genomic DNA was extracted from muscle larvae of T. spiralis using standard protocols. Whole genome shotgun, BAC and EST libraries were generated3,6. The assembly was performed using the PCAP package44. The physical map for T. spiralis was constructed using 26,784 clones (Supplementary Note).

The repeats were masked using RECON45 and RepeatMasker (see URLs). Then the ribosomal RNA genes were identified using RNAmmer (see URLs). Transfer RNA genes were identified with tRNAscan-SE46. Noncoding RNAs were identified by sequence homology searches of the Rfam database (see URLs). Protein-coding genes were predicted using a combination of ab initio programs47 and FgenesH (Softberry, Corp) and the evidence-based program EAnnot48. A consensus gene set from the above prediction algorithms was generated using a logical, hierarchical approach. Gene product naming was determined by BER (see URLs). The signal peptide for secretion and the trans-membrane-domain–containing proteins were identified using Phobius49.

Protein families and genome evolution.

OrthoMCL19 was used to predict orthologous groups of proteins. Phylogenetic trees were built for protein families with one member from each of the six species using PHYLIP (version 3.69; see URLs) after aligning the family members with MUSCLE (version 3.7; ref. 50). The consensus tree of the trees was used as the phylogeny of the species. Death and birth of each protein family overlaid over species phylogeny was constructed using PHYLIP-DOLLOP by treating each protein family as a character. Gene duplication and deletion events of the families having members from each of the six species were reconstructed using URec51, and a neighbor joining tree of each family was generated using PHYLIP-NEIGHBOR.

The dynamics of nematode chromosome reassortment among multiple nematode pairs was measured using OrthoCluster25 and using syntenic blocks of C. elegans for standardization. For the identification of the ancestral orthologous regions, we used exons that are orthologous among species as map 'anchors'52 (Supplementary Note).

Nematode-specific molecular features.

A profile was built for each of the 85 scNOGs using HMMBUILD53. The profiles were calibrated using hmmcalibrate and each profile was used to search the Pfam database (release 23.0). Hits better than 0.1 were considered. The selected non-nematode species were of evolutionary distances similar to C. elegans and T. spiralis: human, chicken, zebrafish and frog. After identification of the non-nematode families that were associated with the same Pfam as the scNOGs, the multi-fasta files were aligned using MUSCLE. These alignments were used to build a distance matrix using PHYLIP-PROTDIST. RNAi source data were from Wormmart from Wormbase release 180. The core nematode groups were screened against nematode (1.1 million ESTs and/or Roche/454 cDNAs) and arthropod (5.3 million ESTs) transcript data and sequence homology at 35 bits, and 55% identity cut-off was accepted as significant.

Structural annotation and comparison of interaction partners.

The three-dimensional structure was modeled using the Rosetta3.0 software suite54,55,56. A total of 40,000 decoys were generated using the full-atom scoring method57 for each sequence. Several of the decoys with a small radius of gyration and low all-atom energy (that is, the bottom of the energy well) were compared using TM-align58 and MAMMOTH59. The position of the insertions was mapped onto the models generated. The secondary structure predictions calculated for the Rosetta ab initio program were added to the sequence alignment generated by MUSCLE50. The functional importance of the insertions in the electron transfer complex was further dissected by comparing interacting proteins. Two protein-protein interaction databases, IntAct60 and MINT61, were used to see if this protein or its orthologs were involved in a protein-protein interaction.

Functional associations and taxonomic restrictions.

Default parameters for InterProScan (v16.1) were used to search against the InterPro database62, and Gene Ontology (GO63) annotations were obtained with no additional curation (IEA associations only). These annotations have been displayed graphically by AmiGO and can be accessed at Nematode.net37. The significant enrichment of GO terms was computed based on the hypergeometric distribution using FUNC64 (including false discovery rate, FDR). A probability refinement was done to remove the GO terms identified as significant due to their children terms. We used the FDR computed by FUNC to reduce false discovery. Therefore, unless specified otherwise, the GO term enrichment was selected based on both P value < 0.05 (after refinement) and FDR < 0.1.

The gene products were associated with a specific biochemical pathway using the KEGG pathway mappings65. WU-BLAST matches of the genes against KEGG database version 46.0 was used for pathway mapping with an E-value filter of 1 × 10−10. Graphical presentation of the pathway associations was done using NemaPath38. The C. elegans NemaCyc viewer is based on mapping a BLASTP alignment of the KEGG's genes database against the predicted T. spiralis genes. Scores stronger than 1 × 10−10 were considered.

Accession codes.

The T. spiralis Whole Genome Shotgun project (project id 12603) has been deposited at DNA Data Bank of Japan, EMBL and GenBank under the accession ABIR00000000. The version described in this paper is the second version, (contigs, ABIR02000001ABIR02009267; scoffolds, GL622784GL629646; proteins, EFV46182EFV62561.

Authors contributions

M.M., D.P.J., J.P.M., D.S.Z., E.R.M. and R.K.W. initiated the project; J.A. and D.S.Z. provided all the worms for the shotgun and D.P.J. for the cDNA sequencing; L.F. and R.S.F. directed sequencing and sequence improvement; S.-P.Y., P.M. and W.C.W. assembled the genome and evaluated the assembly; V.B., X.Z. and K.H.-P. directed annotation; M.M., Z.W., S.A., J.M., Y.Y. and C.M.T. contributed to most of the specific analysis presented in this manuscript; M.M., D.P.J., D.S.Z. and S.W.C. directed the project and assembled the manuscript.