Phylogenomic and biogeographic reconstruction of the Trichinella complex

Trichinellosis is a globally important food-borne parasitic disease of humans caused by roundworms of the Trichinella complex. Extensive biological diversity is reflected in substantial ecological and genetic variability within and among Trichinella taxa, and major controversy surrounds the systematics of this complex. Here we report the sequencing and assembly of 16 draft genomes representing all 12 recognized Trichinella species and genotypes, define protein-coding gene sets and assess genetic differences among these taxa. Using thousands of shared single-copy orthologous gene sequences, we fully reconstruct, for the first time, a phylogeny and biogeography for the Trichinella complex, and show that encapsulated and non-encapsulated Trichinella taxa diverged from their most recent common ancestor ∼21 million years ago (mya), with taxon diversifications commencing ∼10−7 mya.

P arasitic diseases cause substantial morbidity and mortality in billions of animals and humans worldwide, and also major losses to the global food production annually. Parasitic roundworms (nematodes) of the genus Trichinella cause disease (trichinellosis) in humans, which can lead to serious morbidity and mortality 1,2 . Although trichinellosis is endemic in many regions of the world, the predominant impact of human disease relates principally to acute outbreaks following consumption of infected, raw meat products 2 , with examples in Argentina, China, Laos, Papua New Guinea, Romania and Vietnam [3][4][5][6][7][8][9] .
The biological diversity of Trichinella is of major evolutionary significance and reflects substantial genetic diversity, divergent ecology and host-parasite affiliations 11,14,15 . Despite some advances, there have been significant knowledge gaps in the biogeography and phylogeny of the Trichinella complex, as most previous studies represent analyses of small-scale molecular data sets, limiting interpretation and conclusions [1][2][3][4][5][6][7] . Progress in genomic, transcriptomic and bioinformatic technologies [16][17][18] now provides unique opportunities to rapidly overcome such limitations and to enable research on Trichinella and trichinellosis. Although the nuclear genome of T. spiralis 19 and the mitochondrial genomes of most Trichinella taxa have been characterized 20,21 , there have been no global, nuclear genomic or transcriptomic data sets for the 11 other recognized taxa of Trichinella. In the present study, we sequenced and assembled draft genomes and transcriptomes for all currently recognized Trichinella species and genotypes and defined protein-coding gene sets. Using these data sets, we then reconstructed the phylogeny and biogeography of Trichinella, as a basis to address some pertinent and outstanding questions regarding the evolution and biology of Trichinella.

Results
Genomes and transcriptomes. We sequenced and assembled 16 draft genomes and 15 transcriptomes from all 12 currently recognized Trichinella taxa, including five distinct geographic isolates of T. pseudospiralis (Tables 1 and 2 and Supplementary  Tables 1-4). The draft assemblies of these genomes (available publicly; see 'Accession codes' section) ranged from 46.1 to 51.5 Mb (mean: 49.0 Mb), with average scaffold N50s of 106-294 kb (mean: 196 kb) and GC contents of 32.5-33.7% (33.2%; Tables 1 and 2). For all assemblies, 96.4% (95.6-97.2%) of all 248 core essential genes were identified (Tables 1 and 2), indicating that the gene sets represent substantial proportions of individual genomes. The repeat contents of individual draft genomes were estimated at 6.7-21.8% (mean: 17.7%) of their total nucleotide compositions (Supplementary Tables 5 and 6). On average, the assemblies contain 2.4% (range: 1.1-3.7%) retrotransposons, 2.9% (0.7-5.5%) DNA transposons, 7.7% (0.04-10.6%) unclassified dispersed elements and 4.8% (4.1-5.8%) simple repeats (Supplementary Tables 5 and 6). The repeat content of the genome of genotype T9 (6.7%) is exceptionally low compared with those of all other Trichinella taxa (mean: 17.7%). A comparison of the present draft genome for T. spiralis (ISS3) with that published previously for same species (ISS195) 19 revealed essentially the same percentage of core essential genes (96%), and repeat and GC contents, but a size difference (14 Mb), likely to relate to differing sequencing and assembly methodologies or a genetic distinctiveness between the two geographic isolates.
Phylogeny. We reconstructed the phylogeny of all 12 currently recognized Trichinella taxa (cf. Tables 1 and 2). First, we conducted an analysis of a complete protein sequence data set representing all 1,284 SCOs shared by all taxa, including five distinct geographic isolates of T. pseudospiralis, and two outgroup species (that is, Trichuris suis and A. suum) using Bayesian inference, maximum likelihood and maximum parsimony to establish the phylogenetic relationships of all encapsulated and non-encapsulated Trichinella taxa. Here all three trees constructed had the same topology, with strong support values (1.0; 87-100%) at all nodes, irrespective of the algorithm used (Fig. 1a). Second, employing the same algorithms, we independently assessed the relationships of all 16 Trichinella gene sets (without an outgroup) using protein sequences encoded by 2,855 SCOs shared by all taxa. Again, all three resultant trees had the same topology, consistently achieving very strong nodal support (1.0; 94-100%), with the exception of the pair T. spiralis þ T. nelsoni in the maximum parsimony analysis (support: 64%). Third, to address this discrepancy, we studied separately the relationships of all encapsulated taxa using protein sequences encoded by 4,090 SCOs, and achieved strong support at all nodes for the three algorithms (1.0; 99-100%). Then, to assess the root relationship of sister species T. spiralis þ T. nelsoni to nonencapsulated clades, we employed the same approach to resolve the relationships of three encapsulated (T. spiralis, T. nelsoni and T. patagoniensis) and one non-encapsulated taxon (T. papuae) using protein sequences encoded by 4,300 SCOs, and achieved the same outcome (that is, absolute support at all nodes). The final consensus tree constructed (Fig. 1a) represents the results from the first and second analyses (for all three algorithms). As expected, the dendrogram constructed from syntenic correlation values (Supplementary Data 2) was consistent in topology with this consensus tree and supported the encapsulated and non-encapsulated clades (Fig. 1a).
For the encapsulated taxa, the consensus tree shows that a lineage with T. spiralis þ T. nelsoni is the sister to other taxa, which hierarchically include T. patagoniensis, and pairs of sister species (T. nativa þ T6, T. murrelli þ T9 and T. britovi þ T8; Fig. 1a). These relationships differ from those inferred in previous studies using few ribosomal or mitochondrial DNA sequences 10,14,15 , although some previously recognized sister species associations are concordant, and T. spiralis þ T. nelsoni  Description  T1  ISS3   T2  ISS10   T3  ISS120   T5  ISS417   T6  ISS34   T7  ISS37   T8  ISS272   T9  ISS409   T12 and Trichinella genotypes T6, T8 and T9. and T. patagoniensis are relatively basal. At the time of initial phylogenetic assessment of Trichinella 14 , sequence data were not available for all five T. pseudospiralis isolates (T4.1-T4.5) or T. patagoniensis. Here, for non-encapsulated taxa, the pair T. papuae þ T. zimbabwensis is basal to all representatives of T. pseudospiralis investigated (that is, T4.1-T4.5); T4.4 from North America is basal to T4.5 from Tasmania and to all three isolates from Eurasia (T4.1-T4.3; Fig. 1a). Using the amino-acid sequences encoded by shared SCOs (n ¼ 2,855), tree topology and nodal support values were similar to those achieved using wholemitochondrial data sets 21 , although these values were consistently high compared with previous analyses using small gene sets 14 Here, for non-encapsulated taxa, the pair T. papuae þ T. zimbabwensis is basal to all representatives of T. pseudospiralis investigated (that is, T4.1-T4.5); T4.4 from North America is basal to T4.5 from Tasmania and to all three isolates from Eurasia (T4.1-T4.3; Fig. 1a). Therefore, we have now been able to establish, for the first time, the phylogenetic relationships of all currently recognized Trichinella taxa, including five T. pseudospiralis representatives (that is, T4.1-T4.5) and the more recently discovered T. patagoniensis.
Divergence and biogeography. Divergence time analysis based on the nematode diversification estimate of 532-382 million years ago (mya) 33 (1,000 SCOs) implied that Trichinella and Trichuris suis had a most recent common ancestor (MRCA) B281 mya (95% credible interval: 384-204), and that the encapsulated and non-encapsulated Trichinella taxa shared an MRCA B21 (28-15) mya (Fig. 1a), coinciding with the transition from Oligocene to Miocene 34 . Relative to a deep age for the Trichinella lineage, origin of a specific adaptation for encapsulation, associated with the radiation of nine taxa, occurred late in the evolutionary history of these nematodes. Subsequent diversification leading to extant species or species groups within the encapsulated and nonencapsulated clades is temporally circumscribed in the upper and uppermost Miocene during the Tortonian and Messinian periods 35,36 , commencing B7 (9-5) and 10 (14-7) mya ( Fig. 1a) and continuing into the Pliocene and Pleistocene 34 . Despite major methodological differences, our estimates are very consistent with those of Zarlenga et al. 14 in postulating a geographic distribution for the MRCA of all Trichinella taxa in Eurasia. Further concordance is seen in estimates for initial divergence of respective clades for encapsulated and nonencapsulated forms during the mid-Miocene, coincidental with perturbations in temperate ecosystems before diversification of extant species (Fig. 1a,c); this scenario might relate to an early Miocene glaciation 37 , with low sea levels 38 allowing a regional interchange of faunas linking Africa, Eurasia and North America 39 .
The diversification of non-encapsulated T. pseudospiralis from the common ancestor of T. papuae and T. zimbabwensis (in poikilotherms; Fig. 1a,c) coincides with the Tibetan plateau uplift and climate change around 10-8 mya 40 and the divergence between T. papuae and T. zimbabwensis (4.9-2.3 mya), possibly contemporaneously with climate change and the Plio-Pleistocene extinction of some crocodylomorph reptiles 41 . An avian host might explain the occurrence of T. pseudospiralis south of the Tibetan plateau.
Radiation of encapsulated Trichinella involves Eurasia, Africa, North America and South America. The occurrence of T. nelsoni (7.8-4.1 mya) and T. britovi þ T8 (3.2-1.7 mya) on the African continent (Fig. 1c) follows temporally discrete and independent expansion events around 7.5 mya (Miocene), 4.5-4 and 3.5 mya (Pliocene) and 2.0 mya (Pleistocene) 42 . A separation of T. nelsoni þ T. spiralis near the Miocene-Pliocene boundary further establishes the basis for an independent association with hominins and humans based on ecology and later domestication of primary suid hosts for the latter species 43 . We hypothesize that the diversification of T. britovi and T8 took place in Africa (Fig. 1c) as a consequence of biogeographic barriers and changing environmental conditions, which is in accord with a previous suggestion by Pozio et al. 44 Alternatively, the isolation and divergence of an ancestral population across Eurasia and Africa, leading to T. britovi and T8, with later secondary expansion, would account for the extensive geographic range of the former species 10,14 . We suggest that the loss of forests, formation of grasslands and the food scarcity for herbivores during the Miocene 45,46 across Eurasia and Africa resulted in a massive expansion of an entirely new guild of predators/hunters that were able to follow scarce prey over vast distances in the later Miocene, Pliocene and Pleistocene 47 . Encapsulated Trichinella taxa initially expanded into North America across Beringia during a time frame deeper than 5 mya, when the landmass was a permanent geographic feature linking Eurasia and the Nearctic, and before the inception of Northern Hemisphere glaciations 48 . The distribution of T. patagoniensis is consistent with the initial emergence of the Panamanian Isthmus (B10 mya), as recently established [49][50][51] . New chronologies are also consistent with the complex nature of faunal assembly in the Neotropical region, providing a mechanism for geographic colonization by small cats, followed by extensive radiation in South America after 8 mya, before the Great American Interchange 52 . Interestingly, there is no current evidence for geographic colonization of South America after 3 mya by nearctic species of Trichinella (for example, T. murrelli or T6), coincidental with large felids, procyonids, canids or mustelids (Fig. 1c). Inception of Northern Hemisphere glaciation cycles and periodic emergence of the Bering land bridge after 2.5 and 2.0 mya 53 led to independent episodes of geographic colonization and host-switching, driving patterns of isolation and genetic divergence (radiation) of Trichinella, linking Eurasia and the Nearctic 48 . Diversification of Eurasian/Nearctic sister species, including T. nativa þ T6, T. murrelli þ T9 and T. pseudospiralis in Northern America, reflects an intricate history in response to climate variation and habitat change, facilitating independent events of biotic expansion among carnivoran and other mammalian assemblages, including ursids, canids and mustelids primarily from Eurasia into North America (Fig. 1c). In addition, we hypothesize that T9 diverged from a common ancestor with T. murrelli before geographic colonization of the New World, or following isolation across Bering Strait during a glacial maximum (Fig. 1c). In the absence of a fossil record for Trichinella, it would be useful to determine the sequence of at least one Trichinella taxon from an extinct, infected vertebrate (for example, carnivoran) as a reference in time to assist future studies. The chronological and spatial history of Trichinella is described by episodic or cyclical patterns of independent geographic and host colonization on global and regional scales, the extensive development of mosaic faunal assemblages consistent with an integrated history for taxon pulses and ecological fitting mediated by climate and habitat variation over the late Tertiary 48,54,55 .
Parasite-host adaptation. The significantly higher GC content in the genomes and coding regions of encapsulated compared with non-encapsulated taxa (Kolmogorov-Smirnov (KS) test: P values ¼ 3.801 Â 10 À 4 and 7.466 Â 10 À 4 , respectively; effect size ¼ 1) suggests that these two Trichinella clades may have adapted differently to varying environmental stresses (for example, temperature), host immune attacks and/or body temperatures (reptiles versus birds and mammals). We also propose that the intracellular lifestyle and associated bottleneck have resulted, throughout evolution, in a considerable reduction of genome size (49 Mb) and lower GC content (33%) in both encapsulated and non-encapsulated Trichinella taxa compared with extracellular relatives such as Trichuris suis (78.5 Mb; 44%) 22 , which accords with observations for selected pathogens 56,57 .
Considering the morphological differences between encapsulated and non-encapsulated Trichinella taxa in host muscle cells 12 , we focused on exploring ES molecules likely to be associated with parasite-host interactions or host cell modification. Although more than half (52.9%) of genes encoding putative ES proteins could not be annotated (Supplementary Data 1), we identified seven SCOs (groups (GRPs) 2,426, 4,153, 2,136, 3,720, 2,832, 3,607 and 2,897) encoding ES orphans whose gene transcription is significantly upregulated exclusively in encapsulated taxa, and which are hypothesized to modulate host immune responses or immune evasion. By contrast, we identified an SCO (GRP 2254) encoding a cathelidicin-like molecule whose transcription is upregulated only in all non-encapsulated taxa, and might suppress antigen processing and/or presentation, a proposal supported by some evidence for an orthologue (Fh-HDM-1) in the parasitic flatworm Fasciola hepatica 58 . In both encapsulated and non-encapsulated taxa, we identified an orphan protein (GRP 838; GenBank accession: BG520575) that possesses a structural motif that is consistent with mitochondrial cytochrome c (Supplementary Data 3) and/or the mitochondrial dynamics protein of 51 kDa (MID-51) 59  Subsequently, we identified 1,042 orthologous gene groups (GRPs) exclusively in genomes of the encapsulated taxa, and 747 were exclusive to non-encapsulated taxa (Fig. 1a). We explored the distinctiveness between all encapsulated and non-encapsulated taxa, for which RNA sequencing (RNA-seq) data were available by comparing their transcription profiles at the L1 stage for all SCOs (n ¼ 2,855; Supplementary Tables 3 and 4). This analysis identified SCOs that were uniquely transcribed in encapsulated (n ¼ 47) and non-encapsulated (n ¼ 68) taxa (Supplementary Data 3). Interestingly, the average GC content of all of these 115 SCOs is significantly lower (KS test: P valueo3.059 Â 10 À 7 ; effect size ¼ 1) than that of all genes of all Trichinella taxa. In addition, the significantly lower AT content of differentially transcribed genes (KS test: P value ¼ 4.076 Â 10 À 2 ; effect size ¼ 0.67) in non-encapsulated taxa suggests a distinct adaptation of genes to the muscle cell compared with encapsulated taxa.
Interestingly, we found expansions in two orthologous groups in non-encapsulated taxa. The first (GRP 149) represents a serine protease precursor with two trypsin-like domains (TsSerP and AF331156) 65 , which is expanded in all non-encapsulated taxa (Supplementary Data 3) and appears to play a vital role in larval feeding and/or moulting 65 . The second group represents a multicystatin-like domain protein (MCD-1) 32 (GRP 482) encoded by six gene copies in T. papuae, and two copies in T. pseudospiralis (T4.5) from Australia and T. zimbabwensis compared with one copy in all other Trichinella taxa. This protein contains three repeated domains, each with similarity to family 2 cystatins 66 , but lacking critical consensus sites for cysteine protease inhibition. In this respect, the domain organization is similar to that of mammalian kininogens and a known six-domain cystatin from F. hepatica 67 . We propose that MCD-1 might function as a cytokine antagonist by binding to transforming growth factors (TGF)-b1 and -b2 in a manner similar to fetuin 68 , a proposal that is consistent with previous findings for T. spiralis and evidence of low-level expression of TGF-b in the epithelium of the jejunum of infected mice 69 . Certainly, family 2 cystatins secreted by other parasitic nematodes have known roles in immune evasion, including the inhibition of proteases involved in antigen presentation and modulation of cytokine responses 70 . Although MCD-1 is unlikely to function as a typical cystatin, it is expressed at the parasite-host interface 32 and might modulate host immune responses or enable immune evasion. An expanded set of MCD-1s might allow some non-encapsulated Trichinella taxa to suppress immune responses better than encapsulated taxa, possibly facilitating dissemination into a wider range of vertebrate hosts 12 .

Discussion
Using Illumina-based sequencing and bioinformatics, we assembled draft nuclear genomes representing all 12 recognized taxa of Trichinella, including five geographic isolates of T. pseudospiralis. Using extensive amino-acid sequence data sets derived from all shared SCOs from these nuclear genomes and/or outgroup taxa, we were able to reconstruct the phylogeny of Trichinella. In a previous study, Zarlenga et al. 14 had proposed the phylogenetic relationships of Trichinella species and genotypes using small DNA sequence data sets. At the time, these authors provided comprehensive interpretations of the findings and concluded that post-Miocene expansion, colonization and host-switching drove speciation among extant members of the genus Trichinella; although their study was very informative and interesting, the resolution of some clades and the positions of some of the taxa, such as T. nativa and T6 as well as T. murrelli and T9, did not always appear to be well supported statistically by the data presented, when outgroups (mermithids and Trichuris spp.) were included in the analyses. The reason for the limited resolution of some relationships was likely because of the use of sequence data set representing only selected genetic loci (nuclear small subunit and second internal transcribed spacer; mitochondrial large subunit and cytochrome c oxidase subunit 1) 14 . In addition, it appears that some of the species selected as outgroups might have been too distant to achieve a robust phylogeny using all of the concatenated sequence data in the analyses.
Here we utilized an extensive amino-acid sequence data set representing all SCOs (concatenated alignment over 597,495 amino-acid positions) originating from all 16 genomes representing all members of the Trichinella complex, including all known geographic isolates of T. pseudospiralis and T. patagoniensis, not previously available 14,15 , as well as Trichuris suis and A. suum (outgroups) in the analyses. By contrast, in a previous study 21 we could not use Trichuris or Ascaris as outgroups because of excessive sequence divergence in predicted mitochondrial proteins (for example, NAD4 and NAD6) encoded by some genes, such that an unambiguous alignment of homologous characters was impossible; these outgroups therefore had to be excluded from the analyses of mitochondrial data sets, as their inclusion would have led to erroneous results and interpretation. In the present study, the definition of 1,284-2,855 SCOs among all Trichinella taxa (with or without outgroups) allowed, employing an iterative, stepwise approach, the reconstruction of a robust phylogeny utilizing three independent tree-building methods. In accordance with previous investigations 14,21,68,69 , the present analyses showed that the encapsulated taxa grouped separately from non-encapsulated taxa. Importantly, we were also able to resolve, for the first time, the phylogenetic positions of T. spiralis þ T. nelsoni and T. murrelli þ T9, five representatives of T. pseudospiralis and T. patagoniensis in relation to all other taxa, with strong (99-100%) statistical support. The diversification times of Trichinella matched well with known historical global events and allowed the biogeography of all taxa to be reconstructed. This biogeography clearly supports the notion that encapsulated and non-encapsulated taxa frequently occur in sympatry and in the context of faunal assembly, driven strongly by events of geographic and host colonization, involving complex spatial and chronological mosaics linking evolutionary and ecological time 54 . Nonetheless, in the future, extensive sampling of Trichinella taxa from around the world will be required to explore, in depth, population genetic structures to reassess the present phylogenic and biogeographic reconstruction, and to attempt to identify the host origin of ancestral forms of T. spiralis, the principal causative agent of human trichinellosis. In conclusion, although the present study focused sharply on fundamental phylogenetic and biogeographic aspects, the genomic and transcriptomic resources created here will substantially accelerate many fundamental and applied areas of Trichinella/trichinellosis research. In addition, the genome-wide approach that we employed should have applicability to a wide range of parasites and other metazoan organisms.  Tables 1 and 2). These samples were individually produced in female CD1 mice. L1s were recovered from host muscles by pepsin (1%)-HCl (1%) digestion at 40°C for 30 min, sedimented, washed extensively in physiological saline and suspended in 90% ethanol for storage 71 . Each sample, which represented a packed volume of 50 ml of L1s, was snap-frozen in liquid nitrogen and kept at À 80°C until nucleic acid isolation.

Methods
RNA-seq and transcriptome assembly. Total RNA was isolated separately from L1s of each isolate of Trichinella employing the TriPure isolation reagent (Roche Molecular Biochemicals). Packed volumes of 20-50 ml were used, equating to thousands of larvae. RNA yields were estimated spectrophotometrically (NanoDrop 1000), and the integrity of RNA was verified using the BioAnalyzer. RNA-seq was conducted using an established method 72  Other methods. Genome sequencing and assembly, prediction of repetitive elements, prediction of protein-encoding genes, functional annotation, phylogenetic and divergence time analyses, synteny, GC content and differential transcription analyses are described in Supplementary Methods.