Comparative genomics of the major parasitic worms

Parasitic nematodes (roundworms) and platyhelminths (flatworms) cause debilitating chronic infections of humans and animals, decimate crop production and are a major impediment to socioeconomic development. Here we report a broad comparative study of 81 genomes of parasitic and non-parasitic worms. We have identified gene family births and hundreds of expanded gene families at key nodes in the phylogeny that are relevant to parasitism. Examples include gene families that modulate host immune responses, enable parasite migration though host tissues or allow the parasite to feed. We reveal extensive lineage-specific differences in core metabolism and protein families historically targeted for drug development. From an in silico screen, we have identified and prioritized new potential drug targets and compounds for testing. This comparative genomics resource provides a much-needed boost for the research community to understand and combat parasitic worms. Comparative study of 81 genomes of parasitic and non-parasitic worms identifies gene family births and expanded gene families at key nodes in the phylogeny that are relevant to parasitism and proteins historically targeted for drug development.

). Nevertheless, findings made using a subset of high-quality assemblies that were designated 'tier 1' (Methods and Supplementary Table 4) were corroborated against all species.
Genome size varied greatly within each phylum, from 42 to 700 Mb in nematodes, and from 104 to 1,259 Mb in platyhelminths. In a small number of cases, size estimates may have been artifactually inflated by high heterozygosity causing alternative haplotypes to be represented within the assemblies (Supplementary Note 1. 3 and Supplementary Table 2a). A more important factor appeared to be repeat content that ranged widely, from 3.8 to 54.5% (5-95% percentile; Supplementary Table 5). A multiple regression model, built to rank the major factors driving genome size variation, identified long terminal repeat transposons, simple repeats, assembly quality, DNA transposable elements, total length of introns and low complexity sequence as being the most important (Supplementary Note 1.3, Methods and Supplementary Table 6). Genome size variation is thus largely due to non-coding elements, as expected 45 , including repetitive and non-repetitive DNA, suggesting it is either non-adaptive or responding to selection only at the level of overall genome size.
Gene family births and expansions. We inferred gene families from the predicted proteomes of the 91 species using Ensembl Compara 46 . Of the 1.6 million proteins, 1.4 million were placed into 108,351 families (Supplementary Note 2.1 and Supplementary Data), for which phylogenetic trees were built and orthology and paralogy inferred (Methods, Supplementary Fig. 2 Table 7). Species trees inferred from 202 single-copy gene families that were present in at least 25% of species (Fig. 1), or from presence/absence of gene families, largely agreed with the expected species and clade relationships, except for a couple of known contentious issues ( Supplementary Fig. 3, Supplementary Note 2.2 and Methods).

and Supplementary
The species in our dataset contained significant novelty in gene content. For example, ~28,000 parasitic nematode gene families contained members from two or more parasitic species but were absent from Caenorhabditis elegans and 47% of gene families lacked any functional annotation (Supplementary Note 2.1 and Methods).

Comparative genomics of the major parasitic worms
International Helminth Genomes Consortium* Parasitic nematodes (roundworms) and platyhelminths (flatworms) cause debilitating chronic infections of humans and animals, decimate crop production and are a major impediment to socioeconomic development. Here we report a broad comparative study of 81 genomes of parasitic and non-parasitic worms. We have identified gene family births and hundreds of expanded gene families at key nodes in the phylogeny that are relevant to parasitism. Examples include gene families that modulate host immune responses, enable parasite migration though host tissues or allow the parasite to feed. We reveal extensive lineagespecific differences in core metabolism and protein families historically targeted for drug development. From an in silico screen, we have identified and prioritized new potential drug targets and compounds for testing. This comparative genomics resource provides a much-needed boost for the research community to understand and combat parasitic worms.    The latter families tended to be smaller than those with annotations ( Supplementary Fig. 4) and, in many cases, correspond to families that are so highly diverged that ancestry cannot be traced, reflecting the huge breadth of unexplored parasite biology. Gene families specific to particular parasite clades are likely to reflect important aspects of parasite biology and possible targets for new antiparasitic interventions. At key nodes in the phylogeny that are relevant to parasitism, we identified 5,881 families with apparent clade-specificity (synapomorphies; Supplementary Note 2.3, Methods and Supplementary Table 8), although our ability to discriminate truly parasite-specific clades was limited by the low number of free-living species. The apparent synapomorphies were either gene family births, or subfamilies that were so diverged from their homologues that they appeared as separate families. Functional annotation of these families was diverse (Fig. 2), but they were frequently associated with sensory perception (such as G-protein coupled receptors; GPCRs), parasite surfaces (platyhelminth tegument or nematode cuticle maintenance proteins) and protein degradation (proteases and protease inhibitors).
Among nematodes, clade IVa (which includes Strongyloides spp.; Fig. 1) showed the highest number of clade-specific families, including a novel ferrochelatase-like family. Most nematodes lack functional ferrochelatases for the last step of haem biosynthesis 47 , but harbor ferrochelatase-like genes of unknown function, to which the synapomorphic clade IVa family was similar ( Supplementary Fig. 5 and Methods). Exceptions are animal parasites in nematode clades III (for example ascarids and filaria) and IV that acquired a functional ferrochelatase via horizontal gene transfer 48,49 . Within the parasitic platyhelminths, a clade-specific inositol-pentakisphosphate 2-kinase (IP2K) was identified. In some species of Echinococcus tapeworms, IP2K produces inositol hexakisphosphate nanodeposits in the extracellular wall (the laminated layer) that protects larval metacestodes 50 . The deposits increase the surface area for adsorption of host proteins and may promote interactions with the host 51 .
Paralogous expansions of gene families, particularly those that are large or repeatedly involve related processes, can be evidence of adaptive evolution. We searched among our 10,986 highest-confidence gene families (those containing ≥ 10 genes from tier 1 species) for those that had expanded in parasite clades. A combination of scoring metrics (Methods) reduced the list to 995 differentially distributed families with a bias in copy number in at least one parasite clade. Twenty-five expansions have previously been observed, including 21 with possible roles in parasitism ( Supplementary  Fig. 6). A further 43 were placed into major functional classes that historically have been favored as drug targets (kinases, GPCRs, ion channels and proteases 52 Fig. 7 and Supplementary Table 9a). Even when families could be functionally annotated to some extent (for example, based on a protein domain), discerning their precise biological role was a challenge. For example, a sulfotransferase family that was expanded in flukes compared with tapeworms includes the Schistosoma mansoni locus that is implicated in resistance to the drug oxamniquine 53 but the endogenous substrate for this enzyme is unknown (Supplementary Fig. 7j).
Among the newly identified expansions, we focused on those with richer functional information, especially where they were related to similar biological processes. For instance, we identified several expansions of gene families involved in innate immunity of the parasites, as well as their development. These included families implicated in protection against bacterial or fungal infections in nematode clade IVa (bus-4 GT31 galactosyltransferase 54 , irg-3 55 ) and clades Va/Vc (lysozyme 56 and the dual oxidase bli-3 57 ) (Supplementary Fig. 8a-d). In nematode clade IIIb, a family was expanded that contains orthologs of the Parascaris coiled-coiled protein PUMA, involved in kinetochore biology 58 (Fig. 2b). This expansion possibly relates to the evolution of chromatin diminution in this clade, which results in an increased number of chromosomes requiring correct segregation during metaphase 59 . In nematode clade IVa and in Bursaphelenchus, an expansion of a steroid kinase family ( Supplementary Fig. 8e) is suggestive of novelty in steroidregulated processes in this group, such as the switch between freeliving or parasitic stages in Strongyloides 60 .
Infections with parasitic worms are typified by their chronicity and a plausible involvement in host-parasite interactions is a recurring theme for many of the families. Taenia tapeworms and clade V strongylid nematodes (that is Va, Vb and Vc; Fig. 1) contained two expanded families with apyrase domains that may have a role in hydrolyzing ATP (a host danger signal) from damaged host tissue 61 ( Fig. 2b and Supplementary Fig. 9a). Moreover, many of the strongylid members also contained amine oxidoreductase domains, possibly to reduce production of pro-inflammatory amines, such as histamine, from host tissues 62 . In platyhelminths, we observed expansions of tetraspanin families that are likely components of the host/pathogen interface. Described examples show tetraspanins being part of extracellular vesicles released by helminths within hosts 63 ; or binding the Fc domain of host antibodies 64 ; or being highly immunogenic 65 (Supplementary Fig. 9b,c). In strongylids, especially clade Vc, an expansion of the fatty acid and retinolbinding (FAR) family, implicated in host-parasite interaction of plant-and animal-parasitic nematodes 66,67 (Supplementary Fig. 9d), suggests a role in immune modulation. Repertoires of glycosyl transferases have expanded in nematode clades Vc and IV, and tapeworms ( Supplementary Fig. 10a-c), and may be used to evade or divert host immunity by modifying parasite surface molecules directly exposed to the immune system 68 ; alternatively, surface glycoproteins may interact with lectin receptors on innate immune cells in an inhibitory manner 69 . An expanded chondroitin hydrolase family in nematode clade Vc may possibly be used either for larval migration through host connective tissue or to digest host intestinal walls ( Supplementary Fig. 9e). Similarly, an expanded GH5 glycosyl hydrolase family contained schistosome members with egg-enriched expression 8,70 that may be used for traversing host tissues such as bladder or intestinal walls ( Supplementary Fig. 9f). In nematode clade I, we found an expansion of a family with the PAN/ Apple domain, which is implicated in attachment of some protozoan parasites to host cells 71 , and possibly modulates host lectinbased immune activation ( Supplementary Fig. 9g).
The SCP/TAPS (sperm-coating protein/Tpx/antigen 5/pathogenesis-related protein 1) genes have been associated with parasitism through their abundance, secretion and evidence of their role in immunomodulation 72 but are poorly understood. This diverse superfamily appeared as eight expanded Compara families. A more comprehensive phylogenetic analysis of the full repertoire of 3,167 SCP/TAPS sequences (Supplementary Note 2.5, Supplementary  Table 10 and Methods) revealed intra-and interspecific expansions and diversification over different evolutionary timescales ( Fig. 3 and Supplementary Figs. 11a,b and 12). In particular, the SCP/TAPS superfamily has expanded independently in nematode clade V (18-381 copies in each species) and in clade IVa parasites (39-166 copies) ( Fig. 3 and Supplementary Fig. 11c). Dracunculus medinensis (Guinea worm) was unusual in being the only member of clade III to display an expansion (66 copies), which may reflect modulation of the host immune response during the tissue migration phase of its large adult females.
Proteins historically targeted for drug development. Proteases, GPCRs, ion channels and kinases dominate the list of targets   for existing drugs for human diseases 52 , and are attractive leads for developing new ones. We therefore explored the diversity of these superfamilies across the nematodes and platyhelminths ( Supplementary Fig. 13, Supplementary Note 3 and Methods).
Proteases and protease inhibitors perform diverse functions in parasites, including immunomodulation, host tissue penetration, modification of the host environment (for example, anticoagulation) and digestion of blood 73 . M12 astacins have particularly expanded in nematode clade IVa (five families), as previously reported 18 Table 11). Because many of these species invade through skin (IVa, Vc; Supplementary Table  12) and migrate through the digestive system and lung (IVa, Vc, Vb; Supplementary Table 13), these expansions are consistent with evidence that astacins are involved in skin penetration and migration through connective tissue 74 . The cathepsin B C1-cysteine proteases are particularly expanded in species that feed on blood (two expansions in nematode clades Vc and Va 30 , with highest platyhelminth gene counts in schistosomatids and Fasciola 12 ; Supplementary Fig.  14). Indeed, they are involved in blood digestion in adult nematodes 75 and platyhelminths 76 , but some likely have different roles such as larval development 77 and host invasion 78 .
Different protease inhibitors may modulate activity of parasite proteases or protect parasitic nematodes and platyhelminths from degradation by host proteases, facilitate feeding or manipulate the host response to the parasite 79 . The I2 (Kunitz-BPTI) trypsin inhibitors are the most abundant protease inhibitors across parasitic nematodes and platyhelminths (Fig. 4). An expansion of the I17 family, which includes secretory leukocyte peptidase inhibitor, was reported previously in Trichuris muris 17 but the striking confinement of this expansion to most of the parasites of clade I is now apparent (Fig. 4). We also observed a notable family of α -2-macroglobulin (I39) protease inhibitors that are present in all platyhelminths but expanded in tapeworms ( Supplementary Fig.  14). The tapeworm α -2-macroglobulins may be involved in reducing blood clotting at attachment or feeding sites; alternatively, they may modulate the host immune response, since α -2-macroglobulins bind several cytokines and hormones 80 . Chymotrypsin/elastase inhibitors (family I8) were particularly expanded in clades Vc and IVa (consistent with upregulation of I8 genes in Strongyloides parasitic stages 18 ) and to a lesser extent in clade IIIb (Fig. 4), consistent with evidence that they may protect Ascaris from host proteases 81 . We also identified protein domain combinations that were specific to either nematodes or platyhelminths (131 and 50 domain combinations, respectively). Many of these involved protease and protease inhibitor domains. In nematodes, several combinations included Kunitz protease inhibitor domains, and in platyhelminths metalloprotease families M18 and M28 were found in novel combinations (Supplementary Table 14 Table 15). GPCR families lacking sequence similarity with known receptors included the platyhelminth-specific rhodopsin-like orphan families (PROFs), which are likely to be class A receptors and peptide responsive, and several other fluke-specific non-PROF GPCR families. The massive radiation of chemoreceptors in C. elegans was unmatched in any other nematode (87% versus ≤ 48% of GPCRs). All parasitic nematodes possessed chemoreceptors, with the most in clade IVa, including several large families synapomorphic to this clade ( Supplementary  Fig. 15), perhaps related to their unusual life cycles that alternate between free-living and parasitic forms.
Independent expansion and functional divergence has differentiated the nematode and platyhelminth pentameric ligand gated ion channels ( Supplementary Fig. 16, Supplementary Table 16 and Supplementary Note 3.4). For example, glutamate signaling arose independently in platyhelminths and nematodes 83 , and in trematodes the normal role of acetylcholine has been reversed, from activating to inhibitory 84 . Our analysis suggested the platyhelminth acetylcholine-gated anion channels are most related to the Acr-26/27 group of nematode nicotinic acetylcholine receptors that are the target of the anthelmintics morantel and pyrantel 85 , rather than to nematode acetylcholine-gated cation channels, targeted by nicotine and levamisole ( Supplementary Fig. 17).
ABC transporters (Supplementary Table 17 and Supplementary Note 3.5) and kinases (Supplementary Note 3.6 and Supplementary  Fig. 18) showed losses and independent expansion within nematodes and platyhelminths. The P-glycoprotein class of transporters, responsible for the transport of environmental toxins and linked with anthelmintic resistance, is expanded relative to vertebrates 86 , with increased numbers in nematodes ( Supplementary Fig. 19).

Metabolic reconstructions of nematodes and platyhelminths.
In the context of drug discovery, understanding the metabolic capabilities of parasitic worms may reveal vulnerabilities that can be exploited in target-based screens for new compounds. For each of the 81 nematode and platyhelminth species, metabolism was reconstructed based on high confidence assignment of enzyme classes (Supplementary Table 18a range of annotated enzymes per species than the platyhelminths ( Supplementary Fig. 20a), in part reflecting the paucity of biochemical studies in platyhelminths. Because variation in assembly quality or divergence from model organisms 87 Proteases   C1  M12  M17 M18  S 1  I2 I4 I1 I93  I 83  I12  I19  I35  I33  I31 Supplementary Fig. 21). Pathways related to almost all metabolic superpathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) 88 showed significantly lower coverage for platyhelminths (versus nematodes) and filaria (versus other nematodes) ( Supplementary Fig. 20b).
In contrast to most animals, nematodes possess the glyoxylate cycle that enables conversion of lipids to carbohydrates, to be used for biosyntheses (for example, during early development) and to avert starvation 89 . The glyoxylate cycle appears to have been lost independently in the filaria and Trichinella species (Fig.  5a; M00012), both of which are tissue-dwelling obligate parasites. The filaria and Trichinella have also independently lost alanineglyoxylate transaminase that converts glyoxylate to glycine (Fig.  5b). Glycine can be converted by the glycine cleavage system (GCS) to 5,10-methylenetetrahydrofolate, a useful one-carbon pool for biosyntheses, and two key GCS proteins appear to have been lost independently from filaria and tapeworms, suggesting their GCS is non-functional (Supplementary Table 19e). In addition, filaria have lost the ability to produce and use ketone bodies, a temporary store of acetyl coenzyme A (CoA) under starvation conditions (Supplementary Table 19b). The filaria lost these features after they diverged from D. medinensis, an outgroup to the filaria in clade IIIc that has a major difference in its life cycle, namely, a free-living larval stage (Supplementary Table 12).
The absence of multiple initial steps of pyrimidine synthesis was observed in some nematodes, including all filaria (as previously reported 23 ) and tapeworms, suggesting they obtain pyrimidines from Wolbachia endosymbionts or from their hosts, respectively (Supplementary Table 19f). Similarly, all platyhelminths and some nematodes (especially clade IVa and filaria IIIc) appear to lack key enzymes for purine synthesis (Supplementary Table 19g) and rely on salvage instead. However, despite the widespread belief that nematodes cannot synthesize purines 90,91 , complete or near-complete purine synthesis pathways were found in most members of clades I, IIIb and V. Nematodes are known to be unable to synthesize haem 47 but the pathway was found in platyhelminths, including S. mansoni (despite conflicting biochemical data 47 ) (Supplementary Table 19h  and Supplementary Table 20i).
Genes from the β -oxidation pathway, used to break down lipids as an energy source, were not detected in schistosomes and some cyclophyllidean tapeworms (Hymenolepis, Echinococcus; Fig. 5a, M00087; Supplementary Table 19a). These species live in glucoserich environments and may have evolved to use glucose and glycogen as principal energy sources. However, biochemical data suggest they do perform β -oxidation 92 , so they may have highly diverged but functional β -oxidation genes.
The lactate dehydrogenase (LDH) pathway is a major source of ATP in anaerobic but glucose-rich environments. Platyhelminths have high numbers of LDH genes, as do blood-feeding Ancylostoma hookworms ( Supplementary Fig. 22g). Nematode clades Vc (including Ancylostoma) and IIIb have expansions of α -glucosidases that may break down starch and disaccharides in host food to glucose ( Supplementary Fig. 22a). Many nematodes and flatworms use malate dismutation as an alternative pathway for anaerobic ATP production 93 . The importance of the pathway for clade IIIb nematodes was reflected in expanded families encoding two key pathway enzymes PEPCK and methylmalonyl CoA epimerase, and the intracellular trafficking chaperone for cobalamin (vitamin B-12), a cofactor for the pathway (Supplementary Fig. 22c- Fig. 23b).
Identifying new anthelmintic drug targets and drugs. As an alternative to a purely target-based approach that would require extensive compound screening, we explored drug repurposing possibilities. We developed a pipeline to identify the most promising targets from parasitic nematodes and platyhelminths. These sequences were used in searches of the ChEMBL database that contains curated activity data on defined targets in other species and their associated drugs and compounds (Supplementary Note 5 and Methods). Our pipeline identified compounds that are predicted to interact with the top 15% of highest-scoring worm targets (n = 289). These targets included 17 out of 19 known or likely targets for World Health Organization-listed anthelmintics that are represented in ChEMBL (Supplementary Table 21b). When compounds within a single chemical class were collapsed to one representative, this potential screening set contained 5,046 drug-like compounds, including 817 drugs with phase III or IV approval and 4,229 medicinal chemistry compounds (Supplementary Table 21d). We used a self-organizing map to cluster these compounds based on their molecular fingerprints (Fig. 6). This classification showed that the screening set was significantly more structurally diverse than existing anthelmintic compounds ( Supplementary Fig. 24). The 289 targets were further reduced to 40 high-priority targets, based on predicted selectivity, avoidance of side-effects (cladespecific chokepoints or lack of human homologues) and putative vulnerabilities, such as those suggested by gene family expansions in parasite lineages, or belonging to pathways containing known or likely anthelmintic targets (Supplementary Fig. 25). These 40 targets were associated with 720 drug-like compounds comprising 181 phase III/IV drugs and 539 medicinal chemistry compounds. There is independent evidence that some of these have anthelmintic activity. For example, we identified several compounds that potentially target glycogen phosphorylase, which is in the same pathway as a likely anthelmintic target (glycogen phosphorylase phosphatase, likely target of niridazole; Supplementary Fig. 25). These compounds included the phase III drug alvocidib (flavopiridol), which has anthelmintic activity against C. elegans 96 . Another example is the target cathepsin B, expanded in nematode clade Va (Supplementary  Table 9a), for which we identified several compounds including the phase III drug odanacatib, which has been shown to have anthelmintic activity against hookworms 97 . Existing drugs such as these are attractive candidates for repurposing and fast-track therapy development, while the medicinal chemistry compounds provide a starting point for broader anthelmintic screening.

Discussion
The evolution of parasitism in nematodes and platyhelminths occurred independently, starting from different ancestral gene sets and physiologies. Despite this, common selective pressures of adaptation to host gut, blood or tissue environments, the need to avoid hosts' immune systems, and the acquisition of complex life cycles to effect transmission, may have driven adaptations in common biological pathways. While previous comparative analyses of parasitic worms have been limited to a small number of species within narrow clades, we have surveyed parasitic worms spanning two phyla, with a focus on those infecting humans and livestock. A large body of draft genome data (both published and unpublished) was utilized but, by focusing on lineage-specific trends rather than individual species-specific differences, our analysis was deliberately conservative. In particular, we have focused on large gene family expansions, supported by the best-quality data and for which functional information was available. Sequencing of further free-living species, better functional characterization, and identification of remote orthologs (particularly for platyhelminths 87 ), will undoubtedly refine the resolution of parasite-specific differences, but our gene family analyses have already revealed expansions and synapomorphies in functional classes of likely importance to parasitism, such as feeding and interaction with hosts. We have used a drug repurposing approach to predict potential new anthelmintic drug targets and drugs/drug-like compounds that we urge the community to explore. Further new potential drug targets, worthy of highthroughput compound screening, may be exposed by the losses of key metabolic pathways and horizontally acquired genes that we find in particular parasite groups. This is an unprecedented dataset of parasitic worm genomes that provides a new type of pan-species reference and a much needed stimulus to the study of parasitic worm biology.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, statements of data availability and associated accession codes are available at https://doi.org/10.1038/ s41588-018-0262-1.   20 Institute of Biodiversity, Animal Health and Comparative Medicine, University of Glasgow, Glasgow, UK. 21 Institute of Immunology and Infection Research, University of Edinburgh, Edinburgh, UK. 22 Institut de Recherche Agricole pour le Développement, Ngaoundéré, Cameroon. 23 Hopkirk Research Institute, AgResearch Ltd, Palmerston North, New Zealand. 24 Facultad de Ciencias, Universidad de la República, Montevideo, Uruguay. 25 Museum of Southwestern Biology, University of New Mexico, Albuquerque, NM, USA. 26 Institute of Infection and Immunity, St. George's, University of London, London, UK. 27 Facultad de Ciencias Medicas, de la Salud y la Vida, Universidad Internacional del Ecuador, Quito, Ecuador. 28 Department of Parasitology, University of Granada, Granada, Spain. 29

Methods
Sample collection and preparation. Sources of material and sequencing approaches are summarized in Supplementary Table 1.
Wellcome Sanger Institute (WSI) data production. The genomes of 36 species (Supplementary Tables 1 and 2) were sequenced at WSI. The C. elegans N2 was also resequenced at WSI.
WSI sequencing and assembly. PCR-free 400-550 bp paired-end Illumina libraries were prepared from < 0.1 ng to 5 µ g genomic DNA, as described for Strongyloides stercoralis 18 . Where there was insufficient DNA, adapter-ligated material was subjected to ~8 PCR cycles.
We used 1-10 μ g gDNA or whole genome amplification DNA to generate 3 kb mate-pair libraries, as described for S. stercoralis 18 . If there was insufficient gDNA, whole genome amplification was performed using GenomiPhi v2. Each library was run on ≥ 1 Illumina HiSeq 2000 lane.
Short insert paired-end reads were corrected and assembled with SGA v0.9.7 98 (Supplementary Fig. 26a). This assembly was used to calculate the k-mer distribution for all odd k of 41-81, using GenomeTools v.1.3.7 99 . The k-mer length for which the maximum number of unique k-mers was present was used as the k-mer setting in a second assembly, using Velvet v1.2.03 100 with SGA-corrected reads. For species with 3 kb mate-pair data, the Velvet assembly was scaffolded using SSPACE 101 . Contigs were extended, and gaps closed and shortened, using Gapfiller 102 and IMAGE 103 . Short fragment reads were remapped to the assembly using SMALT (see URLs), and unaligned reads assembled using Velvet 100 and this merged with the main assembly. The assembly was re-scaffolded using SSPACE 101 , and consensus base quality improved with iCORN 104 . REAPR 105 was used to break incorrectly assembled scaffolds/contigs. We carried out manual improvement for Wuchereria bancrofti and D. medinensis using Gap5 106 and Illumina read-pairs. WSI assembly quality control. Contamination screening. Assemblies were screened for contamination using BLAST 107 against vertebrate and invertebrate sequences (see ref. 108 ). For Anisakis simplex, the assembly contained minor laboratory contamination with S. mansoni, which we removed using BLASTN against S. mansoni.
Assembly completeness. CEGMA v2.4 109 was used to assess completeness. Consistent sets of CEGMA genes were missing from some phylogenetic groups (Supplementary Table 2); these were discounted from the completeness calculation for those species ('CEGMA' in Supplementary Table 2).
Effect of repeats. We re-mapped the short-insert library's reads to the appropriate assembly using SMALT (see URLs; indexing -k13 -s4 and mapping -y 0.9 -x -r 1). For each scaffold of ≥ 8 kb, median (med s ) and mean (m s ) per-base read-depth were calculated using BEDTools 110 , and genome-wide depth (med g ) calculated as the median med s (ref. 17 ). For a l s bp scaffold, the extra sequence that would be gained by 'uncollapsing' repeats was estimated as e s = (m s − med g ) × l s / med g (Supplementary Table 5).
WSI gene prediction. Our pipeline 111 had four steps (Supplementary Fig. 27a). First, repeats were masked. Second, preliminary gene predictions, to use as input for MAKER v2.2.28 112 were generated using Augustus 2.5.5 113 , SNAP 2013-02-16 114 , GeneMark-ES 2.3a 115 , genBlastG 116 and RATT 117 . Third, species-specific ESTs and complementary DNAs from INSDC 118 , and proteins from related species, were aligned to the genome using BLAST 107 . Last, EST/protein alignments and gene models were used by MAKER to produce a gene set.
McDonnell Genome Institute (MGI) data production. The genomes of six species were sequenced at MGI (Supplementary Tables 1 and 2).
MGI sequencing, assembly and quality control. Genome sequencing was carried out on Illumina and 454 instruments (see ref. 119 ). The workflow for each assembly is in Supplementary Table 1. Three kilobase, 8 kb and fragment 454 reads (or Illumina reads) were subject to adapter removal, quality trimming and length filtering ( Supplementary Fig. 26b). Cleaned 454 reads were assembled using Newbler 120 before being scaffolded with an in-house tool CIGA, which links contigs based on cDNA evidence. Cleaned Illumina reads were assembled using AllPaths-LG 121 . The assembly was scaffolded further using an in-house tool Pygap, using Illumina short paired-end sequences; and L_RNA_scaffolder 122 , using 454 cDNA data.
An assisted assembly approach was used for Trichinella nativa, whereby 'cleaned' Illumina 3 kb paired-end sequence data were mapped against the T. spiralis genome using bwa 123 (Supplementary Fig. 26b), and the T. nativa residues were substituted at aligned positions (see ref. 119 ).
Adaptor sequences and contaminants were identified by comparison to a database of vectors and contaminants, using Megablast 124 . Table 22) were generated with the Illumina TS stranded protocol, and reads assembled using Trinity 125 (see ref. 119 ).

MGI transcriptome sequencing and gene prediction. Transcriptome libraries (Supplementary
Genes were predicted using MAKER 112 , based on input gene models from SNAP 114 , FGENESH (Softberry), Augustus 113 , and aligned messenger RNA, EST, transcriptome and protein data from the same or related species (Supplementary Fig. 27b; see ref. 119 ).
Blaxter Nematode and Neglected Genomics (BaNG) data production. The genomes of three species were sequenced by BaNG ( Supplementary Tables 1 and 2).
Sequencing was performed on Illumina HiSeq 2000 and HiSeq 2500 instruments, using 100 or 125 base, paired-end protocols. Paired-end libraries were generated using the Illumina TruSeq protocol.
Sequence data were filtered of contaminating host reads using blobtools 126 . Cleaned reads were normalized with the khmer software 127 using a k-mer of 41, and then assembled with ABySS (v1.3.3) 128 , with a minimum of three pairs needed to connect contigs during scaffolding (n = 3) (Supplementary Fig. 26c). Assemblies were assessed using blobtools and CEGMA 109 .
Augustus 113 was used to predict gene models, trained using annotations from MAKER 112 . As hints for MAKER, we used Litomosoides sigmodontis 454 RNA sequencing data assembled with MIRA 129 and Newbler 120 , and Onchocerca ochengi Illumina RNA sequencing data 130 assembled using Trinity 131 (Supplementary Fig. 27c).
Defining high-quality 'tier 1' species. A subset of nematode and platyhelminth genomes, termed 'tier 1' , was selected that had better-quality assemblies and spanned the major clades (Supplementary Table 4). To choose these, species were selected that (1) had contiguous assemblies (usually N50/scaffold-count > 5), and complete proteomes (usually CEGMA partial > 85%), or (2) that helped to ensure ~50% of the genera in each species group (' Analysis group' in Supplementary Table 4) were represented.
Analysis of repeat content and genome size. For each species, repeat libraries were built using RepeatModeler (see URLs), TransposonPSI (see URLs) and LTRharvest 132 , and the three libraries merged (see ref. 133 ). The merged library was used to mask repeats in a species' genome using RepeatMasker (see URLs; -s).
The initial standard regression model and stepwise model fitting used 'lm' and 'step' in R v3.2.2. The Bayesian mixed-effect model used MCMCglmm 134 (v2.24). To create a mixed-effect model, the species tree (see Methods) was transformed into an ultrametric tree using PATHd8 135 , with a small constant added to short branches to ensure no zero-length branches were reconstructed; and outgroup species were removed. Compara database. An in-house Ensembl Compara 46 database was constructed containing the 81 platyhelminths and nematodes, and 10 additional outgroups (Supplementary Table 2). All parasitic nematode/platyhelminth species with gene sets available at the time (April 2014) were included.
The species tree used to construct the initial version of our database used an edited version of the National Center for Biotechnology Information (NCBI) taxonomy 136 with several controversial speciation nodes represented as multifurcations. For our final database, the input species tree was derived by building a tree based on the previous database version, based on one-to-one orthologs present in ≥ 20 species. To do this, proteins in each ortholog group were aligned using MAFFT v6.857 137 ; alignments trimmed using GBlocks v0.91b 138 , concatenated and used to build a maximum likelihood tree using a partitioned analysis in RAxML v7.8.6 139 , using the minimum Akaike's information criterion (minAIC) model for each ortholog group.
The database was queried to identify gene families, orthologs and paralogs.
Species tree and tree based on gene family presence. We identified 202 gene families present in ≥ 25% of the 91 species (81 helminths and 10 outgroups) in our Compara database (Methods) and always single-copy. For each family, amino acid sequences were aligned using MAFFT v7.205 137 (-auto). Each alignment was trimmed using GBlocks v0.91b 138 (-b4 = 4 -b3 = 4 -b5 = h), and its likelihood calculated on a maximum-parsimony guide tree for all relatively simple (singlematrix) amino acid substitution models in RAxML v8.0.24 139 , and the minAIC model identified. Alignments were concatenated and a maximum-likelihood tree built, under a partitioned model in which sites from a gene were assigned the minAIC model for that gene, with a discrete gamma distribution of rates across sites. Relationships within outgroup lineages were constrained to match the standard view of metazoan relationships (for example, Dunn et al. 140 ). The final tree was the highest likelihood one from five search replicates with different random number seeds. One hundred bootstrap resampling replicates were performed, each based on a single rapid search. We also constructed a maximum-likelihood phylogeny based on gene family presence/absence for families not shared by all 81 nematode/platyhelminth species, using RAxML v8.2.8 139 , with a two-state model and the Lewis method to correct for absence of constant-state observations. Functional annotation. InterProScan 141 v5.0.7 was used to identify conserved domains from all predicted proteins. A name was assigned to each predicted protein based on curated information in UniProt 142 for orthologs identified from our Compara database (Methods), or based on InterPro 143 domains (see ref. 144 ). Gene ontology (GO) terms were assigned by transferring GO terms from orthologs 144 , and using InterProScan.

GO and InterPro/Pfam annotation enrichment.
Counts of proteins annotated with each GO term (or InterPro/Pfam domain) per species were normalized by dividing by the total GO annotations in a particular species. To test for enrichment of a particular GO term in a species group ('analysis group' in Supplementary  Table 25).
Pathway coverage was the fraction of ECs in a reference pathway that were annotated in a species (see ref. 173 ). We included pathways for which KEGG had a reference pathway for a nematode/platyhelminth (Supplementary Table 18e). Presence of KEGG modules was predicted using modDFS 174 , and species clustered based on module presence using Ward-linkage, based on Jaccard similarity index 175 .
Chokepoint enzymes were predicted following Taylor et al. 176 , using subnetworks of KEGG networks formed by just the enzymes (ECs) we had annotated in each particular species.
Potential anthelmintic drug targets and drugs. Potential drug targets. Nematode and platyhelminth proteins from tier 1 species (with high-quality assemblies; Methods) were searched against single-protein targets from ChEMBL v21 177 using BLASTP (E ≤ 1 × 10 −10 ). After collapsing by gene family, 1,925 worm genes remained.
To assign a 'target score' to each worm gene, the main factors considered were similarity to known drug targets; lack of human homologues; and whether C. elegans/Drosophila melanogaster homologues had lethal phenotypes (see ref. 178 ).
Potential new anthelmintic drugs. ChEMBL v21 177 was used to identify 827,889 compounds with activities against ChEMBL targets to which worm proteins had BLAST matches. To calculate 'compound scores' , we prioritized compounds in high clinical development phases, oral/topical administration, crystal structures, properties consistent with oral drugs and lacking toxicity (see ref. 178 ).
Our top 15% (249) of highest-scoring worm targets had 292,499 compounds. These were filtered by selecting compounds that (1) co-appeared in a PDBe 179 (Protein Data Bank in Europe) structure with the ChEMBL target; or (2) had median pChEMBL > 5; leaving 131,452 'top drug candidates' .
A 'diverse screening set'. The 131,452 candidates were placed into 27,944 chemical classes, based on ECFP4 fingerprints (see ref. 178 ). They were filtered by (1) discarding medicinal chemistry compounds that did not co-appear in a PDBe structure with the ChEMBL target, or have median pCHEMBL > 7; (2) checking availability for purchase in ZINC 15 180 ; and (3) for each worm target, taking the highest-scoring compound from each class; this gave 5,046 compounds.
Self-organizing map. We constructed a self-organizing map of our diverse screening set plus known anthelmintic compounds (Supplementary Table 21a; see ref. 178

Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main text, or Methods section).

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code

Data collection
No software was used to collect the data in the study.

Data analysis
A large number of software applications were used in this study. All software used (custom and commercial/publicly available) are listed in the Methods. All custom scripts are available on request from the corresponding authors.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: Field-specific reporting Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/authors/policies/ReportingSummary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
Sample size was determined by the availability of parasite material. All samples were surplus material from other ongoing research projects, and due to the difficulties involved with obtaining parasite material, sample size was determined primarily by sample availability, rather than a predetermined number.
Data exclusions Some samples provided for this study were of poor quality, and thus the resulting data was of insufficient quality to warrant inclusion in the data set. Exclusion criteria were not predetermined.

Replication
Experimental findings were not reproduced due to the scale of the study, in terms of time and cost, combined with the issue associated with obtaining parasite material.
Randomization Allocation of samples into experimental groups was done so based on taxonomic classification.

Blinding
Blinding was not relevant to this study as analysis were explicitly comparative.
Reporting for specific materials, systems and methods Obtaining unique materials Not all unique materials used in this study are available due to a number of them being from wild or livestock animals, rather than laboratory maintained animals. They are either unique samples that could not easily be obtained again, or all the available sample has been used up in this experiment. In some cases, material from laboratory maintained populations may be available on request where feasible.

Animals and other organisms
Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research

Laboratory animals
Samples obtained were parasite materials that were surplus to other existing ongoing projects, either from wild animals, laboratory animals or already dead animals (e.g. from an abattoir). Further details on the samples are given in Supplementary  Table 1.

Wild animals
Samples obtained were parasite materials that were surplus to other existing ongoing projects, either from wild animals, laboratory animals or already dead animals (e.g. from an abattoir). Further details on the samples are given in Supplementary  Table 1.

Field-collected samples
Samples obtained were parasite materials that were surplus to other existing ongoing projects, either from wild animals,