Identification of microsporidia host-exposed proteins reveals a repertoire of large paralogous gene families and rapidly evolving proteins

Pathogens use a variety of secreted and surface proteins to interact with and manipulate their hosts, but a systematic approach for identifying such proteins has been lacking. To identify these ‘host-exposed’ proteins, we used spatially restricted enzymatic tagging followed by mass spectrometry analysis of C. elegans infected with two species of Nematocida microsporidia. We identified 82 microsporidia proteins inside of intestinal cells, including several pathogen proteins in the nucleus. These microsporidia proteins are enriched in targeting signals, are rapidly evolving, and belong to large, Nematocida-specific gene families. We also find that large, species-specific families are common throughout microsporidia species. Our data suggest that the use of a large number of rapidly evolving species-specific proteins represents a common strategy for these intracellular pathogens to interact with their hosts. The unbiased method described here for identifying potential pathogen effectors represents a powerful approach for the study of a broad range of pathogens.


Introduction 22
Pathogens exploit hosts to promote their own proliferation. Viral, bacterial and eukaryotic 23 pathogens control their hosts using effector proteins that interact directly with host molecules [1][2][3] . 24 These effector proteins can be exported out of the pathogen into host cells or they can remain 25 attached to the pathogen but with regions of the protein exposed to the host environment. These 26 host-exposed proteins perform molecular functions that range from manipulation of host defenses 27 to modulation of host pathways that can promote pathogen growth 4,5 . In many cases these 1 proteins are evolving under diversifying selection, such that variation among these proteins can 2 influence host survial [6][7][8] . Examples to date indicate considerable variation in the proteins that 3 pathogens use to interface with their hosts. The conservation of these host-exposed proteins 4 varies among different types of pathogens. Whereas most effectors of a strain of Pseudomonas 5 syringae are present in other Pseudomonas strains and over 35% are conserved in other bacterial 6 genera 9 , fewer than 15% of predicted host-exposed proteins of Plasmodium falciparum are 7 reported to be conserved among Plasmodium species 3 . 8 9 Comprehensive identification of pathogen proteins that are host-exposed is challenging, because 10 they need to be distinguished from proteins that are localized inside of pathogen cells. Several 11 studies have addressed this problem by identifying proteins secreted from pathogens into culture 12 media 10,11 . However, such studies potentially miss proteins that are only present in the native 13 context. To circumvent this issue, a recent study chemically labeled proteins inside pathogenic 14 bacteria and then identified those that were delivered inside of host cells 12 . Although powerful, 15 this approach requires that a pathogen be both culturable and genetically tractable, and thus it is 16 not generally applicable to many intracellular pathogens. Additionally, these approaches do not 17 provide information on the subcellular localization for pathogen proteins within host cells. To 18 address these limitations, we adapted spatially restricted enzymatic tagging for the study of 19 pathogen host-exposed proteins. Spatially restricted enzymatic tagging is a recently developed 20 approach for labeling proteins in specific subcellular locations. This approach uses the enzyme 21 ascorbate peroxidase (APX) to promote biotin labeling of neighboring proteins, which can be 22 subsequently purified and identified with mass spectrometry 13 . Here, we take advantage of this 23 localized proteomics approach to identify host-exposed proteins from microsporidia that are 24 localized in the intestinal cells of an infected animal. 25 Microsporidia constitute a large phylum of fungal-related obligate intracellular eukaryotic 1 pathogens. The phylum contains over 1400 described species that infect diverse animals 2 including nematodes, arthropods, and vertebrates, although individual species often have a 3 narrow host range 14,15 . Dependent on their hosts for survival and reproduction, they have reduced 4 genomes that lack several key regulatory and metabolic pathways 16,17 . Together these properties 5 make microsporidia an excellent model of pathogen evolution. Despite the fact that microsporidia 6 are of both medical and agricultural importance, tools for genetic modification of microsporidia are 7 lacking and almost nothing is known about the proteins that enable interactions with their hosts 18 . 8 Two potential targeting signals are known that could expose microsporidia proteins to the host. 9 These are N-terminal signal-sequences that direct proteins for secretion 19 , and transmembrane 10 domains that could be used to attach proteins to the pathogen plasma membrane with regions of 11 the microsporidia protein in direct contact with host molecules 20 . A number of studies have used 12 these two targeting signals to predict the set of proteins encoded by pathogen genomes that are 13 likely to be host-exposed 21,22 . However, it is unclear how accurate these approaches are at 14 identifying such proteins in microsporidia and these prediction methods do not distinguish 15 between proteins partially or wholly outside the microsporidia cell from those directed to internal 16 membranes or compartments 13 . Although some host-exposed microsporidia proteins have been 17 characterized, no comprehensive identification of such proteins has been carried out 23,24 . 18 Several microsporidia of the genus Nematocida naturally infect C. elegans, a model organism 19 that offers a number of advantages for the study of host-pathogen interactions 25,26 . Infection of C. 20 elegans by N. parisii begins with spores being ingested and then invading host intestinal cells. N. 21 parisii initially develops in direct contact with the cytoplasm as a meront, eventually differentiating 22 into a transmissible spore form that exits the cell 27 . Although the infection reduces worm lifespan, 23 infected animals can generate enormous numbers of spores before death, with a single worm 24 able to produce over 100,000 spores during the course of the infection 25,28 . Using C. elegans, we 25 now report the first unbiased identification of microsporidia host-exposed proteins inside of an 1 animal. These identified proteins are enriched for rapidly evolving proteins and members of 2 unique large gene families. We also find that these species-specific large families are common 3 throughout microsporidia. Using the properties we identified for host-exposed proteins in 4 Nematocida, we analyzed 23 microsporidia genomes to predict potential host-exposed proteins, 5 almost all were found to have no known molecular function. These results suggest that 6 microsporidia use a set of lineage specific, rapidly evolving proteins to interact with their hosts. 7 This study provides a foundation for further functional characterization of host-exposed 8 microsporidian proteins, and demonstrates the utility of proximity-labeling proteomic methods to 9 broadly identify pathogen proteins localized within host cells. 10

Results 11
Experimental Identification of Nematocida host-exposed proteins 12 To identify microsporidia proteins that come into contact with the intracellular host environment 13 we used the technique of spatially restricted enzymatic tagging 13 . This approach uses the enzyme 14 ascorbate peroxidase (APX) to label proteins in the compartment where the enzyme is expressed, 15 with a biotin handle for subsequent purification ( Figure 1A). We generated strains of C. elegans 16 expressing GFP-APX, either in the cytoplasm, or in the nucleus of intestinal cells ( Figure 1B). We 17 also generated a negative control strain that expresses GFP in the intestine, but without the APX 18 protein (Table S1). 19 20 First, we inoculated these transgenic animals with N. parisii spores, which led to the majority of 21 animals being infected ( Figure S1). These animals were then incubated for 44 hours at 20°C to 22 allow for growth of the parasite. Next, we added the biotin-phenol substrate and hydrogen 23 peroxide to these animals to facilitate APX-mediated biotinylation of host and pathogen proteins 24 proximal to the GFP-APX protein. Under these conditions we detected biotin-labeled proteins by 25 microscopy in the intestinal cells of infected animals, but no labeling in the microsporidia cells 1 themselves, demonstrating that the labeling technique is restricted to host-cell regions ( Figure  2 S2). Biotinylated proteins were isolated from total worm extracts using streptavidin-conjugated 3 resin and these purified proteins were identified using mass spectrometry. Biotinylated proteins 4 from infected animals were isolated in triplicate and over 4000 proteins from C. elegans and N 5 parisii were identified ( Figure S3). 6 As validation that proteins were labeled in specific compartments in this experiment, we used the 7 labeled C. elegans proteins as an internal control. By comparing spectral counts identified in the 8 cytoplasmic APX, nuclear APX, and no APX strains, we identified 891 C. elegans proteins 9 specifically labeled in the intestine (Table S2). By comparing C. elegans proteins in the 10 cytoplasmic and nuclear samples we identified 118 proteins specific to the nucleus and 114 11 proteins specific to the cytoplasm. We then compared these proteins to C. elegans proteins with 12 previously reported localization. The set of proteins we identified as either cytoplasmic or nuclear 13 specific are enriched for proteins known to be localized in that subcellular compartment ( Figure  14 S4A). 15 Comparing proteins from the cytoplasmic APX and nuclear APX samples to the no APX sample, 16 we identified 72 N. parisii proteins that were enriched above background levels, as defined by the 17 no APX strain (Table S3). To approximate the total microsporidia proteome detectable in our 18 experiments, we identified 392 N. parisii proteins from the no APX control samples (see methods). 19 We then compared these protein sets to previously generated RNAseq expression data 22 . The 20 host-exposed proteins that we identified had moderate mRNA expression levels, with few 21 detected from either the lowest or highest expressed mRNAs (Figure 2A). In contrast, proteins 22 identified in the no APX control strain are among the most highly expressed mRNAs in the 23 genome ( Figure S5A). This result suggests that the host-exposed proteins we identified are not 24 biased towards highly expressed proteins. 25 Compared to all proteins in the genome, the host-exposed proteins we identified were significantly 1 enriched in both signal peptides and transmembrane domains: over 75% of the proteins identified 2 (enrichment p-value of 6.6E-13) had at least one predicted targeting signal ( Figure 2B). Neither 3 the proteins identified from the no APX control, nor the identified C. elegans intestinal proteins 4 are enriched for these targeting signals compared to the genome ( Figures S4B and S5B). 5 Altogether, the results indicate that our spatially restricted enzymatic tagging technique identified 6 a high-quality data set of N. parisii host-exposed proteins in C. elegans. 7 To investigate the subcellular localization of N. parisii host-exposed proteins, we compared 8 proteins identified from animals expressing APX in the cytoplasm to those identified from animals 9 expressing APX in the nucleus. From this comparison we found four proteins specific to the 10 nucleus and eight proteins specific to the cytoplasm. Of the four nuclear specific proteins, three 11 are predicted to have signal peptides, while all eight cytoplasmic specific proteins are predicted Nematocida species 22,26,29 . We defined large gene families as groups of homologous proteins with 21 at least ten members in one species that were enriched in signal peptides or transmembrane 22 domains. We initially identified these families from paralogous orthogroups and then generated 23 profile hidden Markov models to identify additional members in the genome. 24 There are four large N. parisii gene families that contain from 18 to 169 members. Two of these 1 gene families, NemLGF1 and NemLGF5, encode signal peptides, and the other two gene families, 2 NemLGF3 and NemLGF4, encode C-terminal transmembrane domains. The host-exposed 3 proteins we identified are significantly enriched (p-value of 1.3E-16) in these families and contain 4 35 members of these four genes families, with at least one host-exposed protein in each of the 5 four families ( Figure 2B and C). The four nuclear specific proteins are members of the NemLGF1 6 or NemLGF5 gene family, whereas four of the cytoplasmic specific proteins with transmembrane 7 domains belong to the NemLGF3 family (Table S3). 8 Identified Nematocida host-exposed proteins are clade specific 9 To investigate how the repertoire of N. parisii host-exposed proteins is evolving, we explored 10 whether the identified host-exposed proteins are conserved in three other Nematocida species. 11 The earliest known diverging species of the genus is N. displodere, which proliferates well in the 12 epidermis and muscle, but poorly in the intestine 26 . In contrast the other Nematocida species are 13 intestinal-specific 25 . Previously, the species known to be the most closely related to N. parisii was 14 the intestinal-specific N. sp. 1 (strain ERTm2), which shares 68.3% average amino acid identity 15 with N. parisii. To provide a more closely related species for comparison, we sequenced and 16 assembled the genome of Nematocida strain ERTm5, an intestinal-specific strain that was 17 isolated from a wild-caught C. briggsae in Hawaii 30 . This strain was previously described as a 18 strain of N. parisii based on rRNA sequence, but based on our analysis, it now appears to define 19 a new species (see methods). This genome is comparable in quality to other sequenced genomes 20 as judged both by assembly statistics and the presence of proteins conserved throughout 21 microsporidia (Table S4). This new species, Nematocida ironsii, now represents the closest 22 known sister species to N. parisii and has an average amino acid identity of 84.7% compared to 23 N. parisii ( Figure S6 and Table S5). To examine conservation, each N. parisii protein was placed 24 into an orthogroup using six eukaryotic and 23 microsporidian genomes. Every N. parisii protein 25 was categorized into one of six classes of decreasing conservation: 1) N. parisii proteins 1 conserved with other non-microsporidia eukaryotes, 2) conserved with other microsporidia, 3) 2 conserved with N. displodere, 4) conserved with N. sp1, 5) conserved with N. ironsii, and 6) those 3 that are unique to N. parisii ( Figure 2D). 4 Using this evolutionary approach, we found that the set of host-exposed proteins we identified are 5 significantly enriched (p-value of 1.9E-20) for less conserved proteins, with only 12% having 6 orthologs outside of a group of closely related Nematocida species (N. sp. 1, N. ironsii and N. 7 parisii, which we refer to as 'clade-specific'). In contrast, 63% of all N. parisii proteins in the 8 genome have orthologs outside of this clade of Nematocida species ( Figure 2E). Most of these 9 identified proteins don't have a predicated molecular function, with only five of these 72 proteins 10 containing a predicted Pfam domain ( Figure 2B). To determine the rate of protein evolution, we 11 calculated the protein sequence divergence between orthologous N. parisii and N. ironsii proteins. 12 We found that the host-exposed proteins are rapidly evolving compared to the other proteins in 13 the genome ( Figure 2F). 14 To examine whether the properties of the host-exposed proteins we identified were conserved in 15 other microsporidia species, we performed spatially restricted enzymatic tagging on C. elegans 16 infected with N. sp. 1. Although we identified fewer C. elegans and microsporidia proteins from N. 17 sp. 1 infected animals, we nonetheless found ten proteins enriched over background ( Figure S3  18 and Table S6). These proteins have similar properties to those identified for N. parisii as they are 19 enriched in targeting signals and clade-specific proteins (i.e. proteins not conserved in other 20 eukaryotes, microsporidia, or N. displodere) ( Figure S7). They also are enriched for being 21 members of large gene families, including three members of NemLGF1 and one member of the 22 N. sp. 1-specific family NemLGF6. We also identified two pairs of orthologs from the two species: 23 hexokinase (NEPG_02043 and NERG_02003) and a NemLGF1 family member (NEPG_02370 24 and NERG_01049). To expand this analysis to a different microsporidia genus, we examined data 25 previously generated from germinated Spraguea lophii spores. We found that proteins identified 1 as secreted from these germinated spores were also enriched in the properties of signal peptides 2 and clade-specific proteins ( Figure S8) 24 . 3 Overall, we find that host-exposed proteins are highly enriched in three properties: 1) they have 4 targeting signals (signal peptides or transmembrane domains), 2) they belong to large gene 5 families, and 3) they are clade-specific. In fact, 85% of N. parisii host-exposed proteins identified 6 are either members of large gene families, or are clade-specific proteins with a signal peptide or 7 transmembrane domain (enrichment p-value of 1.7E-25) ( Figure 2E). Although the number of 8 proteins we identified with these properties is 61, the total number of proteins with these properties 9 encoded by the genome is 713. 10 11 Current limitations of proteomic methods suggest that this approach will not result in the complete 12 identification of all host-exposed microsporidia proteins. To estimate the sensitivity of this method 13 we compared the identified C. elegans intestinal proteins to the total number of mRNAs expressed 14 in the intestine 31 . We also compared the total number of detected N. parisii proteins to the number 15 encoded by the proteome. From these comparisons we estimate that we identified between ~8-16 24% of potential host-exposed proteins. This would mean that the total host-exposed proteome 17 encoded by N. parisii is on the order of 300 -900 proteins, a range that encompasses the number 18 of proteins in the genome that have the properties enriched in the experimentally identified host-19 exposed proteins.  Encephalitozoon species but no other species examined. Additionally, we identified four large 16 gene families that were conserved throughout most microsporidia including two ricinB domain 17 containing families 24 . All but one species examined has a large genus-specific family, 18 demonstrating that large gene families are widespread throughout microsporidia. 19 Prediction of putative host-exposed proteins from other microsporidia genomes 20 We next investigated whether proteins that are not widely conserved in microsporidia share 21 properties with the identified host-exposed proteins. We examined 23 microsporidian genomes to 22 identify proteins that are not conserved with other eukaryotes, or conserved with distantly related 23 microsporidia species. These clade-specific proteins are all significantly enriched in targeting 24 signals compared to proteins conserved with more distally related microsporidia or other 25 eukaryotes ( Figure 5A). This result is similar to what we found in our analysis of experimentally 1 identified host-exposed proteins in Nematocida, and similar to a previous study of several 2 microsporidian species 32 . 3 4 Our analyses above indicated that the genomes of microsporidia contain two classes of proteins 5 enriched in targeting signals, clade-specific proteins and large gene families. Most of the proteins 6 (85%) we identified experimentally in N. parisii also display these characteristics. Based on these 7 genomic signatures and our experimental results, putative host-exposed proteins for each species 8 were predicted. These predictions of 11,675 proteins for 23 genomes are provided as a resource 9 in Table S7. Although these characteristics alone may not be sufficient to direct proteins to 10 become host exposed, these proteins likely represent a substantial portion of the host-exposed 11 proteins that each species uses and provide an unprecedented set of candidates for future 12

studies. 13
The potential host-exposed proteins account for 6-32% of the genome of each species. 14 Interestingly, the number of predicted host-exposed proteins can vary even within closely related 15 species, with E. cuniculi having almost twice as many predicted proteins as the other members 16 of the genus ( Figure 5B). The majority of these putative host-exposed proteins do not have a 17 predicted molecular function, with only 7.4% having a predicted Pfam domain that occurs in 18 proteins outside of microsporidia (Table S7) These predictions of host-exposed proteins suggest that microsporidia employ a large number of 2 proteins with novel domains to interact with hosts. 3

Discussion 4
To understand how microsporidia interact with their hosts, we experimentally identified 82 host-5 exposed proteins from two Nematocida species. To identify these proteins, we employed an 6 unbiased approach that labeled the host-exposed pathogen proteins inside of an intact animal. 7 Attempts to validate these host-exposed proteins using orthogonal experimental approaches 8 have not been possible due to our inability to raise specific antibodies against Nematocida 9 proteins and the lack of genomic modification techniques for microsporidia 34 . Nonetheless, this 10 approach was able to identify C. elegans proteins previously shown to be localized to the nucleus 11 and cytoplasm, validating the specificity of the technique. This approach of tagging pathogen 12 proteins based on their localization is likely to be useful in the study other C. elegans pathogens 13 as well as a general tool to examine putative pathogen effector proteins in a range of hosts. 14 15 A key feature of the identified host-exposed proteins is their enrichment in signal peptides and 16 transmembrane domains. This enrichment suggests that these are the two major targeting signals 17 that are used in Nematocida for proteins to become exposed to the host, as they are present in 18 76% of identified proteins. Such signals might be missed in the remaining proteins due to the lack 19 of sensitivity of these prediction methods and the misannotation of the true N-and C-termini of 20 Nematocida proteins 19,20 . The identified proteins could also be useful to discover potential 21 secondary signals in the proteins that direct transmembrane and signal peptide containing 22 proteins to become host exposed, rather than to other membranes inside microsporidia 35 . 23 We found that large gene families are common within microsporidia, with 68 gene families from 24 23 microsporidia genomes being identified. Although several of these families had been 25 previously reported, here we provide a comprehensive identification of these gene families 1 throughout microsporidia 24,26,36,37 . The majority of these large gene families have no known 2 molecular function based on sequence similarity. One enticing possibility is that the expansion of 3 these families is due to interactions with host proteins. In support of this possibility, a number of 4 the gene families with predicted domains are known to mediate protein-protein interactions 5 including LRR and RING domains. 6 One intriguing characteristic of these large gene families is that they are either genus-or species-7 specific, with large lineage specific expansions of these gene families across microsporidia. The and laboratory studies demonstrating that the same strain of microsporidia can infect closely 1 related host species 15,38,39 . We speculate this host diversity could drive the expansion of large 2 gene families in microsporidia and that these large gene families may in turn influence the host 3

range. 4
The majority of the host-exposed proteins we identified in N. parisii and N. sp. 1 were proteins not 5 conserved with N. displodere or other microsporidia species. Although lack of conservation 6 accounts for most of the proteins identified, several conserved proteins were identified, including 7 hexokinase, which we identified in both Nematocida species. Hexokinase was previously found 8 to have predicted signal peptides in several microsporidia species and to be secreted from the 9 microsporidia Antonospora locustae, providing experimental evidence that secreted hexokinase 10 is a conserved feature of microsporidia 22,23,32 . There are also several large gene families that have 11 members present in multiple microsporidia species. This observation suggests that although 12 selective forces result in a host-exposed protein repertoire with many unique proteins for each 13 microsporidia clade, there are some proteins conserved throughout microsporidia involved in host 14

interactions. 15
A number of forces are likely to shape the repertoire of host-exposed proteins, including the 16 selective pressure of the host and interactions with other pathogens. Many of the features of the 17 host-exposed protein repertoire in microsporidia are similar to characteristics reported in the that similar selective pressures can sculpt a host-exposed protein repertoire with related 24 properties. In contrast, strains of the bacteria P. syringae are predicted to have less than 40 type 25 III effectors and contain many effectors shared with other bacteria, many of these which display 1 evidence of horizontal gene transfer 9,40 . 2 A striking result of our analysis is that a large number of experimentally identified and predicted 3 host-exposed proteins do not have domains found outside of microsporidia. These host-exposed 4 proteins are a potential source of novel biochemical activity as the extreme selective pressures 5 inflicted on pathogens by the host has been shown to result in unique molecular functions 41,42 . 6 Interestingly, we also predict a large percent of the microsporidia genome to be responsible for 7 mediating host-pathogen interactions. This suggests that although microsporidia have the 8 smallest known genomes of any eukaryotes they somewhat paradoxically encode a substantial 9 cadre of proteins for interacting with their hosts. Understanding how microsporidia use these 10 proteins to mediate host-interactions will provide insight into their impact on hosts and the 11 constraints on evolution of a minimalistic eukaryotic genome. AWR designed, conducted, and analyzed experiments and co-wrote the paper. KMB provided 20 the N. ironsii genome sequence. EJB performed the mass spectrometry analysis and co-wrote 21 paper. ERT provided mentorship and co-wrote the paper. 22

Competing financial interests 23
The authors declare no competing financial interests.

Cloning and generation of C. elegans expressing APX 3
Soybean APX (W41F) was optimized for C. elegans expression using DNAworks to design 4 primers 43 . These primers were annealed using a two-step PCR method and cloned into Gateway 5 plasmid pDONR 221. Gibson cloning was then used to introduce GFP as an N-terminal fusion, 6 and NES (LQLPPLERLTLD) and NLS (PKKKRKVDPKKKRKVDPKKKRKV) tags to the C-7 terminus of APX 44 . 1 kilobase (kb) of sequence upstream of the intestinal-specific gene spp-5 was 8 used as a promoter and unc-54 as a 3 prime sequence. Multisite Gateway was used to combine 9 these fragments into the plasmid pCFJ150 to generate targeting constructs. The MosSCI 10 approach was used to generate single copy insertions by injecting unc-119 mutants from the 11 EG6699 strain with these targeting constructs 45 . Each transgenic strain was backcrossed to the 12 wild-type N2 strain 3 times and the homozygote was used in subsequent experiments. 13

Spatially restricted enzymatic tagging of microsporidia infected C. elegans 15
C. elegans strains that express GFP-APX either localized to the cytoplasm or nucleus, as well as 16 a control GFP only strain were used (Table S1)

LC-MS-MS parameters 25
Samples were analyzed in triplicate by LC-MS/MS using a Q-Exactive mass spectrometer 1 (Thermo Scientific, San Jose, CA) with the following conditions. The following is a generalized 2 nHPLC and instrument method that is representative of individual analyses. Peptides were first 3 separated by reverse-phase chromatography using a fused silica microcapillary column (100 μm 4 ID, 18 cm) packed with C18 reverse-phase resin using an in-line nano-flow EASY-nLC 1000 5 UHPLC (Thermo Scientific). Peptides were eluted over a 100 minute 2-30% ACN gradient, 6 followed by a 5 minute 30-60% ACN gradient, a 5 minute 60%-95% gradient, with a final 10 7 minute isocratic step at 0% ACN for a total run time of 120 minutes at a flow rate of 250 nl/ min. 8 All gradient mobile phases contained 0.1% formic acid. MS/MS data were collected in a data-9 dependent fashion using a top 10 method with a full MS mass range from 400-1800 m/z, 70,000 10 resolution, and an AGC target of 3e6. MS2 scans were triggered when an ion intensity threshold 11 of 1e5 was reached with a maximum injection time of 60ms. Peptides were fragmented using a 12 FDR below 1%. Peptides were assembled into proteins using maximum parsimony and only 2 unique and razor peptides were retained for subsequent analysis. Peptide spectral count data 3 was mapped onto the assembled proteins and used for subsequent analysis.

Analysis of mass spectrometry data 6
The peptide spectral counts of proteins were used to calculate fold change ratios and FDR p-7 values between GFP only, NES, and NLS samples using the qspec-param program of 8 qprot_v1.3.3 48 . Several criteria were used to classify proteins as being host-exposed proteins; No 9 counts in the GFP only samples and an average greater than 2 peptides in the NES samples or 10 an NES/GFP ratio greater than 2-fold with an FDR p-value of less than 0.005. Additionally proteins 11 with an NLS/GFP ratio of greater than 3-fold were included. Proteins were classified as being 12

NLS-enriched if they had a greater than a 2-fold NLS/NES ratio and NLS depleted if they had 13
greater than a four-fold NES/NLS ratio. All data for N. parisii proteins is in Table S3 and for N. 14 sp. 1 proteins in Table S6. C. elegans intestinal proteins were detected in the same way as 15 described above and data are in Table S2. N. parisii proteins in the no APX sample were required 16 to have an average of greater than 2 peptides in the GFP only sample. 17 18

Microscopy of infected C. elegans 19
To detect biotin labeling in infected worms, intestines were dissected and stained with anti-GFP 20 (Roche) and Streptavidin Alexafluor 568 (Thermo Fisher). Images were taken using a Zeiss 21 LSM700 confocal microscope with a 40x objective. To detect microsporidia in infected worms, 22 fluorescence in situ hybridization with probes specific for microsporidia was performed as 23 previously described and imaged with a Zeiss AxioImager M1 microscope 49 . The genome was assembled and annotated as done previously 26 . Although ERTm5 was 7 previously considered to be a strain of N. parisii based on 100% nucleotide identity of 18S 8 ribosomal RNA sequences 30 , the average nucleotide identity across the genome between N. 9 parisii strain ERTm1 and ERTm5 is 92.3%, which was calculated using the nucmer program in 10 mummer 3.23 50 . The two strains are more dissimilar than the generally accepted definition of 11 different microbial species having less than 95% average nucleotide identity 51 . Because of this 12 we consider strain ERTm5 to be a separate Nematocida species. Because the strain ERTm5 was  Table S4. Annotation of N. ironsii proteins are in Table S5. Conservation of proteins for each 18   Phylogenetic trees were inferred for each family using RAxML 8.2.4 54 using the PROTGAMMALG 1 model and 1000 bootstrap replicates. For NemLGF1, an initial tree was generated using 10 2 bootstrap replicates and then divided into 7 sub trees. Orthologs of N. parisii proteins in each 3 family were manually assigned using these maximum likelihood trees. To determine the genomic 4 location of these families the 5 largest scaffolds of N. parisii (ERTm1) were used. Chromosomal 5 ends were defined as the first and last 30 kb of each scaffold. Adjacent proteins were calculated 6 as where the next protein was next to it. 7 8

Determination of conservation 9
For N. parisii proteins, conservation was determined based on orthogroups, except for the large 10 gene families NemLGF1 and NemLGF2-4 for which orthology was determined as described 11 above. The following procedure was used to place the N. parisii proteins into 6 categories. If a N. 12 parisii protein was in any group with a protein from the 6 non-microsporidian eukaryotic species, 13 the protein was placed in the category "Eukaryotes". If any remaining unassigned proteins were 14 in a group with a protein from the microsporidia species not in the genus Nematocida, then it was 15 placed in the category "microsporidia". If any remaining unassigned proteins were in a group with 16 an N. displodere protein, then it was placed in the category "N. displodere". If any remaining 17 unassigned proteins were in a group with an N. sp. 1 protein, then it was placed in the category 18 "N. sp. 1". If any remaining unassigned proteins were in a group with a N. ironsii protein, it was 19 placed in the category "N. ironsii". The remaining proteins were placed in the category "N. parisii". 20

21
To predict host-exposed proteins the conservation of microsporidia proteins was determined. 22 Proteins of each species were placed into two classes, "Conserved" or "clade-specific". If a protein 23 was in the same group as a protein from any of the eukaryotic or microsporidia species then it 24 was classified as "conserved". Otherwise it was classified as "clade-specific". This was done 25 except for the closer related species where proteins in the same clade were not considered. Proteins for microsporidia genomes were placed into orthogroups as described above. Proteins 6 from one-to-one orthologs of the two N. parisii strains (ERTm1 and ERTm3) and N. ironsii were 7 aligned using MUSCLE 3.8.31 53 . For large gene families orthologs were determined as described 8 above. For proteins conserved with N. sp. 1, the evolution rate was only calculated for one to one 9 orthologs between the 5 genomes. For proteins conserved with N. displodere, the evolution rate 10 was only calculated for one-to one orthologs between all 6 Nematocida genomes. Maximum 11 likelihood trees were built using ortholog sets (three sequences per set) of aligned protein cytoplasmic APX (green circular sectors) labeling microsporidia host-exposed proteins with biotin 7 (red B). B. Animals expressing GFP-APX in the intestine localized to either the cytoplasm (top) 8 or the nucleus (bottom). 9 1 Figure 2. Properties of experimentally identified N. parisii host-exposed proteins. A. 2 Comparison of mRNA expression levels of identified host-exposed proteins (orange dots) to the 3 rest of the expressed N. parisii proteins (blue dots). Expression data are from a previous RNA-4 seq study on animals infected for 30 hours at 25°C 49 . B, E. Comparison of identified host-exposed 5 proteins (orange) to the genome (blue). Enrichment fold change and p-values (one-side Fisher's 6 exact test) of the host-exposed proteins compared to the genome are listed below each category. 7 B. Properties of 72 N. parisii host-exposed proteins. The percentage of the N. parisii genome and 8 the percentage of the host-exposed proteins in each category are shown. TM, transmembrane. 9 SP, signal peptide. C. Left, model of where identified large gene family proteins are localized. 10 Right, the number of proteins of each gene family identified as host exposed and the total number 11 of gene family members present in the genome is shown in parentheses. D. Schematic of the 12 categorization of N. parisii proteins by conservation class. The 2661 proteins in the genome were 13 placed into 6 classes of decreasing conservation from proteins conserved with eukaryotes to 14 being proteins unique to N. parisii. E. Percentage of the genome and host-exposed proteins in 15 each conservation class. F. Distribution of protein sequence divergence between N. parisii and 16 N. ironsii one-to-one orthologs. The genome contains 2083 orthologs that met our criteria and the 1 host-exposed proteins contain 49 orthologs (See methods). The percentage of the identified host-2 exposed proteins (orange) and the genome (blue) is plotted. Wilcoxon two-sample test comparing 3 sequence divergence of orthologs in the genome to the host-exposed proteins has p-value of 4 6.8E-11. of each gene family were determined using HMMER, except for those indicated with an * which 10 were determined using OrthoMCL. The second column indicates the targeting signal that is 11 overrepresented within the indicated gene family. SP, signal peptide, TM, single transmembrane 12 domain, and MTM, multitransmembrane domain. Each box in columns to the right of the gene 13 family name is colored according to the total number of members within a given gene family. 14 15 1 Figure 5. Prediction of host-exposed microsporidia proteins. A. Clade-specific microsporidia 2 proteins (orange) are enriched in signal peptides/transmembrane domains compared to 3 conserved proteins (blue). Enrichment p-values (one-sided Fisher's exact test) are listed in 4 parenthesis below each species. B. The number of proteins in each microsporidia genome that 5 are predicted to be host-exposed proteins (orange), compared to the rest of the genome (blue).