Schistosomiasis is a neglected tropical disease that ranks with malaria and tuberculosis as a major source of morbidity affecting approximately 210 million people in 76 countries, despite strenuous control efforts1. It is caused by blood flukes of the genus Schistosoma (phylum Platyhelminthes), which exhibit dioecy and have complex life cycles comprising several morphologically distinct phenotypes in definitive human and intermediate snail hosts. Schistosoma mansoni, one of the three major human species, occurs across much of sub-Saharan Africa, parts of the Middle East, Brazil, Venezuela and some West Indian islands. The mature flukes dwell in the human portal vasculature, depositing eggs in the intestinal wall that either pass to the gut lumen and are voided in the faeces, or travel to the liver where they trigger immune-mediated granuloma formation and peri-portal fibrosis2. Approximately 280,000 deaths per annum are attributable to schistosomiasis in sub-Saharan Africa alone3. However, the disease is better known for its chronicity and debilitating morbidity4. A single drug, praziquantel, is almost exclusively used to treat the infection but this does not prevent reinfection, and with the large-scale control programmes in place, there is concern about the development of drug resistance. Indeed, resistance can be selected for in the laboratory and there are reports of increased drug tolerance in the field5.
In this study we present the sequence and analysis of the S. mansoni genome. Previous metazoan projects have been restricted to Deuterostomia (for example, Homo, Mus and Ciona) and the ecdysozoan clade of the Protostomia (for example, Drosophila, Caenorhabditis and Brugia). Together with the accompanying article on S. japonicum6, we present, to our knowledge, the first descriptions of metazoan genomes from the lophotrochozoan clade. The genome reveals features that aid our understanding of the evolution of complex body plans. We have mined the genome to predict new drug targets, on the basis of searches involving traditional areas for drug discovery, metabolic reconstruction, and bioinformatics screens that exploit shared pharmacology. It is hoped that these and other targets will accelerate drug discovery, generating the much needed new treatments for the control and eradication of schistosomiasis.
Genome structure and content
The nuclear genome sequence of S. mansoni was determined by whole-genome shotgun sequencing and assembled into 5,745 scaffolds greater than 2 kilobases (kb) (Supplementary Table 1), totalling 363 megabases (Mb). Although 40% of the genome is repetitive, 50% is assembled into scaffolds of at least 824.5 kb. Furthermore, 43% of the genome assembly (distributed over 153 scaffolds) was unambiguously assigned to chromosomes (seven autosomal, plus ZW sex-determination pairs) using fluorescence in situ hybridization (FISH; Fig. 1, Supplementary Fig. 1 and Supplementary Table 2).
Figure 1: Physical map of S. mansoni.

a, b, Idiogram of S. mansoni chromosomes W, Z (a) and 3 (b). S. mansoni BAC clones were mapped to the karyotype of S. mansoni by FISH. The solid black areas are heterochromatin, the open areas are euchromatin. The BAC clones are identified by BAC numbers. c–f, Chromosome spreads with FISH-mapped BACS are shown. FISH-mapped BACS are identified by arrowheads on labelled chromosomes. Scale bar, 10
m. See Supplementary Fig. 1 for idiograms of all S. mansoni chromosomes.
Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
We identified 72 families of both long-terminal repeat (LTR) and non-LTR transposons, comprising 15% and 5% of the genome, respectively, and containing 63 and 60 new families each (Supplementary Table 3). The LTR transposons are from the Ty3/Gypsy and BEL clades, whereas the non-LTR transposons are restricted to the RTE, CR1 and R2 clades. Two previously described non-LTR retrotransposon families from the RTE clade (SR2 and Perere-3)7, 8 seem to have undergone a burst of transposition events after divergence of S. mansoni and S. japonicum, and contribute to an overall higher representation of non-LTR retrotransposons in S. mansoni (15%, around 8% in S. japonicum). A new DNA transposon belonging to the Mu family was also found, which represents the first instance in a flatworm. The presence of target site duplications in some copies indicates recent transposition, and suggests that active copies may still exist in the genome. A lack of terminal inverted repeats, a feature of Mu family members, suggests a peculiar mechanism for recognition of this element by the transposition apparatus.
We identified 11,809 putative genes encoding 13,197 transcripts. Considering genes that do not span a gap, the average gene size is 4.7 kb, typically with large introns (the average is 1,692 base pairs (bp)) and much smaller exons (the average is 217 bp). Moreover, the introns show a markedly skewed size distribution that has not been observed in other eukaryotes, whereby 5' introns are smaller than 3' introns (Fig. 2, Supplementary Information and Supplementary Table 5). In multi-exon genes, the first few introns can be as small as 26 bp, whereas introns towards the 3' end are typically kilobases in length (the largest is 33.8 kb). The reason for this is unclear but it suggests unusual transcriptional control. However, a survey of conserved transcription factor domains shows S. mansoni to be broadly similar to other eukaryotes (Supplementary Information, Supplementary Fig. 2 and Supplementary Table 6). It is noteworthy that 43% of transcription factor families with schistosome representatives also contained vertebrate sequences, nearly twice the number that matched nematode worms, emphasizing their evolutionary distance.
Figure 2: Intron size distribution.

The length of introns varies according to their position in a transcript, counting from the 5' end (solid circles) and the 3' end (open circles). Mean lengths
standard errors are shown. After about five introns, the length difference is no longer apparent owing to the variation in the number of introns per transcript (see Supplementary Information).
Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
Micro-exon genes
At least 45 genes have an unusual micro-exon structure. Individual micro-exons have been described in other genomes, dispersed among several normal exons9. However, S. mansoni is notable in containing micro-exon genes (MEGs) that comprise 75% of the coding sequence, are flanked at the 5' and 3' extremes by conventional exons, and have lengths that are multiples of three bases (from 6 to 36).
Other than having shared gene structure, no similarity could be detected between 14 MEG families (each with up to 23 members; Fig. 3 and Supplementary Table 7). Moreover, they showed no similarity with annotated genes from outside Schistosoma spp., nor any identifiable motifs or functional domains. Comparisons between MEG family members and related proteins from S. japonicum suggest that some gene duplication events preceded the divergence of the two species. Almost all encode a signal peptide at the 5' end and three have membrane anchors, so most are probably secreted. Examination of the large expressed-sequence tag (EST) data set from across the life cycle shows that genes from all MEG families are transcribed in the intramammalian stages of the life cycle, and the germ balls of daughter sporocysts that develop into infective cercariae, but probably not in miracidia that infect the snail intermediate host (Fig. 3).
Figure 3: Schematic representation of gene structure from MEG family members.

a, Structure of a representative member from each MEG family. Where several members were found, the total number detected is indicated in parentheses. Each box represents an exon drawn to scale, and the number above it indicates the exon size in nucleotides. For illustrative purposes, the introns are shown with fixed length. Black triangles and diamonds indicate exons encoding predicted signal peptides and transmembrane helices, respectively. Other characteristics associated with exons are indicated by colour and grouped as follow: micro-exons having lengths that either are multiples of 3 bp (red) or are indivisible by 3 bp (orange); exons longer than 36 bp and having lengths that either are multiples of 3 bp (blue) or are indivisible by 3 bp (green); putative initiation and termination exons (grey); untranslated region (UTR) (black). The asterisk indicates an exon deduced from transcript data, which did not match the sequenced genome. MEG-12 and MEG-13 structures were only partially predicted owing to the lack of transcripts containing the 5' end of these genes. b, PCR with reverse transcription (RT–PCR) or EST-based evidence of transcription (black box) for each family across different life-cycle stages. C, cercaria; E, egg; G, germball; M, miracidium; 3s and 7s, 3- and 7-day schistosomula; 21li and 28li, 21- and 28-day liver worms; 45a, 45-day adult worm pairs.
High resolution image and legend (274K)Download Power Point slide (656K)Slides may be downloaded for educational use, according to the terms described in Nature Publishing Group's licensing policy.
Sequencing of transcripts from three MEG families revealed the occurrence of several alternative splice variants formed by exon skipping. In one of the families analysed, all internal exons except those coding for the signal peptide were missing in at least one transcript sampled, and a gene from a second family presented different transcripts with extended exons produced by the use of alternative splicing sites. These observations suggest that a 'pick and mix' strategy is used to create protein variation.
Evolution of triploblasty, parasitism and tissues
Schistosomes are the first Platyhelminthes to be fully sequenced, and provide insights into the evolution of 'simple' animals. Using Treefam to make comparisons with the sea anemone Nematostella vectensis, a representative of the Radiata, we sought gene families restricted to, or expanded in, the Bilateria (Supplementary Table 8). The advent of a third germ layer in flatworms is paralleled by the expansion of genes encoding cell adhesion molecules such as cadherins. Similarly, tissue-patterning developmental cues (for example, Notch/Delta) and histone-modifying enzymes (for example, histone acetyltransferases) have proliferated. Some genes, such as the tetraspanins that encode membrane structural proteins, have greatly proliferated in schistosomes, suggesting a critical role in worm physiology/parasitism. The large array of paralogues for fucosyl and xylosyltransferases involved in the generation of new glycans expressed at the host–parasite interface may be important for subverting the immune system. The expansion of proteases in schistosomes also seems to be directly related to parasitism, because it includes families involved in host invasion (invadolysins) and blood feeding (cathepsins). Furthermore, G-protein-coupled receptors (GPCRs) show varying levels of contraction in schistosomes, whereas several classes (for example, peropsins) are greatly expanded in Nematostella, indicating functions associated with the free-living lifestyle.
Although schistosomes are acoelomate, they possess tissues approaching the sophistication of organs—such as gut, nephridia, nerve and muscle—that are concerned with discrete physiological processes, such as feeding, excretion and locomotion. However, as lophotrochozoans they are evolutionarily distant from the previously sequenced parasitic nematodes Brugia10 and Meloidogyne11, 12 (both ecdysozoans). Compartmentalisation of schistosome tissues and the formation of epithelial barriers are crucial for life in the hostile environment of the host bloodstream. Schistosomes possess the typical machinery of higher metazoa to interact with the cytoskeleton and control cell polarity (Supplementary Information and Supplementary Table 9), organize epithelia and denote tissue boundary lines.
S. mansoni possess a nervous system that includes an anterior brain and longitudinal nerve cords, which extend from the brain to run the length of the worm body. Furthermore, a variety of sensory structures (at least six types in the cercaria13) are able to transduce a wide range of stimuli that assist in host location, penetration and navigation through the vasculature. In common with more complex organisms, schistosomes possess the tools needed to mediate neurogenesis and control axon growth cones and migration of neural cells (Supplementary Information and Supplementary Table 9), supporting the ancient origins of neural complexity.
Insights into possible new drug targets
Historically, anti-schistosomiasis agents were identified by in vivo screening in animal models. The S. mansoni genome project makes a more target-based approach to drug discovery feasible, and some promising leads have already emerged. These include a family of nuclear receptors14 (Supplementary Information) and a redox enzyme, thioredoxin glutathione reductase, recently validated as a drug target15. The condensed redox biochemistry of S. mansoni, relative to its human host, may offer further drug development targets (Supplementary Information). In the context of drug discovery, we have explored other potential areas of vulnerability, including: lipid metabolism, GPCRs, ligand- and voltage-gated ion channels, kinases, proteases and neuropeptides. We also undertook two bioinformatics-led approaches: metabolic reconstruction to identify chokepoints, and sequence searches for structures related to known drug targets.
Lipid metabolism
S. mansoni contains a full complement of genes required for most core metabolic processes, such as glycolysis, tricarboxylic acid cycle and the pentose phosphate pathway. However, schistosomes are incapable of de novo synthesis of sterols or free fatty acids and must use complex precursors from the host16. An extensive lipid-carrying protein repertoire could be identified, but despite producing precursors for fatty acid synthesis, fatty acid synthase could not be identified. An inability to use isoprene products of the mevalonate pathway probably accounts for the lack of sterol biosynthesis (Supplementary Table 11 and Supplementary Information). The genes necessary for a complete
-oxidation pathway are present, and this usually inactive pathway might operate in reverse to perform syntheses17. Despite constituting 40% or more of the lipid content of adult worms16, triacylglycerol has an uncertain role in the schistosome's life cycle—it is slow to turn over, does not contribute to the formation of other lipids16 and its use as an energy store is doubtful17. Nevertheless, S. mansoni possesses lipases capable of breaking down triacylglycerol, so it may have functions other than preventing too high concentrations of intracellular fatty acids16. Pathways responsible for synthesizing the phospholipid components of membranes are well represented, except that phosphatidylcholine must be derived from diacylglycerol18 and the parasite must depend on its host as a source of inositol.
GPCRs, ligand-gated and voltage-gated ion channels
GPCRs, ligand-gated and voltage-gated ion channels are targets for 50% of all current pharmaceuticals19. At least 92 putative GPCR-encoding genes are present (Supplementary Table 12), the bulk (82) of which are from the rhodopsin family. The largest groups are the
-subfamily (30), which includes amine receptors, and the
-subfamily (24), which contains neuropeptide and hormone receptors. The diversity of the former subfamily underlines the wide range of potential amine/neurotransmitter reactivities of schistosomes, but the tentative identities assigned need to be confirmed by functional studies, as has already been performed for a histamine receptor20. Schistosomes detect chemosensory cues, but a large, unique clade of the mediating receptors was not found. However, the 26 'orphan' rhodopsin family GPCRs may include proteins with this role. Outside the large rhodopsin family, representatives from each of the smaller families of GPCRs, glutamate family (2), frizzled family (3), and the secretin/adhesion family (4) are present.
Each of the three major ligand-gated ion channel families—the Cys-loop family, glutamate-activated cation channels, and ATP-gated ion channels—are represented in the schistosome genome. Of the 13 Cys-loop family ligand-gated ion channels, nine encode nicotinic acetylcholine receptor subunits (Supplementary Fig. 4 and Supplementary Table 13). The remaining four anion channel subunits group among GABA (
-aminobutyric acid), glycine and glutamate receptors, but it is not possible to assign precise identities. The seven schistosome glutamate-activated cation channels comprise at least two sequences from each of the three common sub-groupings. The presence of a functional P2X receptor for ATP-mediated signalling in schistosomes was already known21, and the data here show at least four more.
Voltage-gated ion channels generate and control membrane potential in excitable cells, and are central to ionic homeostasis. There are examples of successful drugs targeting voltage-gated sodium, potassium and calcium channels22. Although voltage-gated sodium channels were not found, at least 41 members from each of the major six transmembrane (6TM) and four transmembrane (4TM) families of potassium channels (Supplementary Table 14) are present. The 6TM voltage-gated potassium channel family (20 members) is the largest, including the well-characterized Kv1.1 channel found in nerve and muscle of adult schsitosomes23. Other classes of 6TM potassium channels include the KQT channels, large calcium-activated channels, small calcium-activated channels, and cyclic-nucleotide-gated groups. This last group, comprising eight members, is most often associated with signal transduction in primary olfactory and visual sensory cells (Caenorhabditis elegans has only five; ref. 24). S. mansoni possesses six 4TM inward-rectifying TWIK-related potassium channels (about 46 in C. elegans). There are four
and two
subunits of voltage-gated calcium channels in schistosomes, and a
subunit is implicated as a molecular target of the anti-schistosomal praziquantel25.
The kinome
Protein kinases are important regulators of many different cellular functions. Both they and their inhibitors have entered the drug development pipeline in recent years26 but few schistosome kinases have been characterized to date. The S. mansoni genome encodes 249 kinases, including 22 genes with alternative splicing (Supplementary Information). This corresponds to 1.9% of the total coding proteins in the genome, a figure comparable to that found in other species27 (Supplementary Fig. 6). S. mansoni possesses representatives of all of the main kinase groups (Supplementary Fig. 7), the largest of which is the CMGC (cyclin-dependent kinases, mitogen-activated protein kinases, glycogen synthase kinase 3 and CK2-related kinases) group, in contrast to other analysed eukaryotic genomes. However, a single class (RCK) is absent from the CMGC family, a deficiency shared with yeast but not nematodes or mammals.
The least represented groups are the casein kinase (CK1) and receptor guanylate cyclase families with only seven and three members, respectively, contrasting with C. elegans, in which casein kinase is the largest group and receptor guanylate cyclase has 27 members. CK1 (and CMGC) group members that are expressed in sperm or during spermatogenesis in C. elegans are missing in S. mansoni.
The degradome
Proteolytic enzymes (proteases), making up an organism's 'degradome'28, operate in virtually every biological and pathological phenomenon29 and are proven drug targets in diverse biomedical contexts30, 31. All five major classes of proteases (aspartic, cysteine, metallo-, serine and threonine) are represented as various clans (mechanistically related groups) in the parasite genome (Supplementary Table 17). The percentage distribution of the major clans is generally similar to that of the human host with some notable exceptions, mainly owing to the expansion of constituent protease families in humans. Of the 73 protease families, 61 are found in humans and in S. mansoni, and 60 families are shared. With 335 sequences, proteases comprise 2.5% of the putative proteome (Supplementary Table 18), consistent with the proportion in other organisms (1–5%), but this is only one-third of that in humans (945 sequences, if A2 family retrovirus and retrotransposon proteases are included).
The greatest difference between host and parasite is in the paucity of chymotrypsin-like S1 family enzymes in the latter (22 versus 135 human sequences). This reflects the evolution and diversification of family S1 for complex and highly regulated proteolysis cascades in vertebrates and some invertebrates, such as innate immunity, development, blood coagulation and complement activation32, 33, 34. From a therapeutic standpoint, the reduced complexity may prove valuable with fewer parasite proteases available for essential life-sustaining functions. For example, robust drug discovery programmes are in place for chymotrypsin-like S1 families35 and peptidase C14 (caspases)36, on which anti-schistosomal drug discovery could 'piggy-back'37. It is also notable that a smaller number of schistosome protease families (for example, C1, M8 and M13) have more members than the respective families in humans. C1 proteases are involved in nutrient digestion by the parasite, which contrasts with the S1 enzymes used in the host. This disparity has already been exploited for a promising anti-schistosome therapy38. One protease family (C83) is apparently unique to S. mansoni.
Apart from the degradome, but involved in its modulation, 34 protease inhibitors were found (Supplementary Table 19). Most of these are serine protease inhibitors belonging to families I2 (Kunitz-type) and I4 (serpins). Two inhibitors of cysteine proteases (cystatins39, 40) and two
-2-macroglobulin homologues (I39) were also identified, as were three inhibitor of apoptosis proteins (I32), one of which is highly expressed in adults, where it may function to regulate one or more of the four schistosome caspases.
Neuropeptides
Thirteen putative neuropeptides were identified (Supplementary Table 20), indicating that schistosomes may have much greater diversity than the two described previously. Apart from the neuropeptide Fs (NPFs), most are apparently restricted to the Platyhelminthes—their absence from humans making them a credible source of anthelmintic drug leads. The predicted product of npp-6 (the amidated heptapeptide AVRLMRLamide) resembles molluscan myomodulin, whereas the two NPP-13 peptides show 100% carboxy-terminal identity with vertebrate neuropeptide-FF-like peptides (peptides ending with a C-terminal sequence PQRFamide); neither of these has previously been reported in any non-vertebrate organism. The discovery of a second NPF (NPP-21b) as well as the known NPP-21a41 is reminiscent of the vertebrate neuropeptide Y (NPY) superfamily, and strengthens the argument that NPFs and NPYs have a common ancestry.
Metabolic chokepoints
A chokepoint analysis of metabolic pathways reconstructed from the S. mansoni genome was used to identify further targets. A total of 607 enzymatic reactions could be placed in pathways, and 120 of these enzymes were identified as chokepoints (Supplementary Table 21). The list of chokepoints includes many that are drug targets in other organisms as well as target reactions already characterized in S. mansoni, validating the approach (Supplementary Information). The list also contains new candidate targets and comprises approximately 1% of the S. mansoni proteome.
Chemogenomics screening
In the context of neglected tropical diseases and with constrained investment in drug discovery, piggy-backing37 or 'drug-repositioning' strategies42 that re-use existing drugs offer potential time-saving and cost benefits. We adopted a twofold strategy to find significant matches between proteins from the parasite and known 'druggable' protein targets of the human host and human-infective pathogens. Using conservative parameters of >50% sequence identity over >80% of the target, we first performed a similarity search against a database of targets curated from medicinal chemistry literature. This revealed 240 distinct S. mansoni transcripts with matches to targets against which there are high quality compounds (Supplementary Table 22). Given the need for short-course, oral therapies against schistosomiasis, this list was further reduced to 94 S. mansoni targets by filtering for potency and predicted bioavailability. A second search, against a database of the targets for human-directed drugs, showed 66 significant matches with pharmaceuticals marketed at present (Supplementary Table 23), corresponding to 34 S. mansoni targets (26, after representing multicopy genes as a single instance; Table 1). For instance, disulfiram, for controlling substance abuse, was highlighted as a potential anti-schistosomal drug; its anti-parasite properties have already been investigated43. Manual inspection of the list for compounds with side effects and toxicity can further refine choices—for example, by eliminating the immunosuppressants, cyclosporin and rapamycin. The remaining known drugs could be directly tested in animal models, and either applied unmodified in anti-schistosomal therapy, or could serve as leads for further optimisation. Widening the search beyond the initial strict criteria would expand opportunities, for example, topoisomerase 1 is retrieved below our initial threshold, at 71% identity but only 58% overlap.
Conclusion
A century after Louis Sambon first named the species in 1907 (ref. 44), the sequencing of the S. mansoni genome is a landmark event. The sequence provides the scientific community with several avenues to study this under-researched human pathogen, and will drive future evolutionary, genetic and functional genomic research. Not least, given that just one drug is widely available to treat schistosomiasis at present, the genome sequence, including the genome-mining analysis presented, offers the possibility that new drug candidates will soon be identified.
Methods Summary
Mixed-sex cercariae from the Puerto Rico isolate of S. mansoni45, released from infected Biomphalaria glabrata snails, were placed in low-melting agarose plugs and genomic DNA was prepared by standard methods. Approximately sixfold coverage of the nuclear genome was obtained using a whole-genome shotgun sequencing approach, in which libraries of different cloned insert sizes (in plasmid, fosmid and bacterial artificial chromosomes (BAC) vectors) were randomly sequenced by Sanger technology from either end. Sequence reads were assembled, and scaffolds were FISH-mapped to individual chromosomes where possible (Supplementary Table 2). The output of several gene prediction algorithms, trained using 409 manually curated gene structures, were integrated into a single set of gene predictions (version 4), which were used for subsequent analyses. Data were accessed from GeneDB (http://www.genedb.org), and Artemis was used for manual annotation and curation of a further 958 genes during subsequent analyses (as described previously46).
Full methods accompany this paper.
e 1, Göttingen 37077, Germany
105 cercariae ml-1. The parasites were transferred to a 42 °C water bath, incubated for 5 min, and mixed with an equal volume of 1.2%
sixfold was calculated.
10-6 considered significant) using the TRIBE-MCL algorithm
50% identity, 
