Local admixture of amplified and diversified secreted pathogenesis determinants shapes mosaic Toxoplasma gondii genomes

Toxoplasma gondii is among the most prevalent parasites worldwide, infecting many wild and domestic animals and causing zoonotic infections in humans. T. gondii differs substantially in its broad distribution from closely related parasites that typically have narrow, specialized host ranges. To elucidate the genetic basis for these differences, we compared the genomes of 62 globally distributed T. gondii isolates to several closely related coccidian parasites. Our findings reveal that tandem amplification and diversification of secretory pathogenesis determinants is the primary feature that distinguishes the closely related genomes of these biologically diverse parasites. We further show that the unusual population structure of T. gondii is characterized by clade-specific inheritance of large conserved haploblocks that are significantly enriched in tandemly clustered secretory pathogenesis determinants. The shared inheritance of these conserved haploblocks, which show a different ancestry than the genome as a whole, may thus influence transmission, host range and pathogenicity.

Supplemental Figure 6 Strains with aneuploidy or large duplicated regions. Select chromosomes are shown for Individual strains where 10 or more genes with CNV occur continuously. 1X genes (black), genes with > 2 X standard deviation (SD) CNV (yellow -borderline CNV), genes with > 3 X standard deviation (SD) CNV (red). Figure 7 Distribution of orthologous genes by species. Total number of genes for each category found in each of T. gondii (ME49 strain), H. hammondi (HHa) or N. caninum (NCLIV). Based on OrthoMCL clustering.

Supplemental Figure 8 Conserved genes among closely related tissue cyst forming coccidia.
A) Number of shared and species-specific (unique) clusters of orthologous genes as defined by OrthoMCL. The number of clusters of orthologous genes shown for T. gondii represents those found in all 16 references strains. B) Analysis of species-specific genes as predicted by OrthoMCL. Orange bars (Blastp hit) represent species-specific genes that identify orthologs in the proteomes of the other two species with a blastp e-value ≤ 1x10 -10 and 50% coverage of the shortest sequence. Blue bars (Full Length) depict species-specific genes that map as syntenic hits to the genomes of the other two species with at least 70% coverage using either GMAP or tblastn (e-value ≤ 1x10 -5 ). These later two categories are likely distant orthologs that are classified as separate groups by OrthoMCL. Turquoise bars (Alternative Gene Model) represent the same as blue bars but for species-specific genes that map with less than 70% coverage or contain non-sense mutations or frame-shifts in the target genome. Gray bars (Unique) depict species-specific genes that do not match the proteomes or genomes of the other two species. Tg, T. gondii; Hh, H. hammondi; Nc, N. caninum.
Supplemental Figure 8 Conserved genes among closely related tissue cyst forming coccidian continued. C) Comparative gene expression for orthologous and predicted species-specific genes in N. caninum and T. gondii. For genes were there is a 1:1:1 orthologous correspondence between N. caninum, T. gondii and H. hammondi there was a good correlation of RNA-expression levels, based on RNA-seq data, between N. caninum and T. gondii (black dots). However for genes that were originally flagged as "unique" by OrthoMCL but for which there are annotated genes in both N. caninum and T. gondii, the transcript levels did not correlate well (red dots). This suggests that these genes encode functionally different products and are not the result of assembly errors or inaccurate gene models. D) RNA-seq analysis monitoring gene expression of putative species-specific genes in T. gondii. Plot shows the fraction of T. gondii genes from each of the four categories depicted in B that are supported (+) or not (-) by RNA-seq data. Potential T. gondii specific genes are defined by comparison to H. hammondi (left) and N. caninum (right). The expectation is that if unique and alternative gene models predicted in T. gondii were artifactual, they would be enriched in regions that lack RNA-seq data, compared to genes where there was evidence for orthologues being present (Full Length and BLASTP). Comparison of the genes in these different categories in regions with (+) and without (-) RNA-seq data showed no significant differences for any of the four gene categories (P > 0.4, Fisher's Exact Test). This result suggests that most unique genes and alternative gene models are not likely the result of artifactual gene predictions. E) To assess if the high number of alternative gene models between T. gondii and H. hammondi was the product of genome sequencing errors we mapped genome sequencing reads to their respective T. gondii ME49 strain and H. hammondi genomes and calculated the maximum alternative allele frequency (MAAF) score within regions of putative species-specific genes. MAAF scores were defined as the maximum alternative allele frequency detected for SNPs and INDELs within a defined genomic region (see methods). It is expected that regions containing sequencing errors will present MAAF scores of ≥ 40%. This analysis revealed that only a small fraction (< 10 regions) of the loci containing potential alternative gene models in T. gondii (left) and H. hammondi (right) have MAAF scores ≥ 40%. This result suggests that frame-shifts and missense mutations found in alternative gene model loci are not the result of low sequencing quality in these regions. F) Analysis of sequence coverage. More than 90% of the loci hit by alternative gene models in T. gondii and H. hammondi have a minimum sequencing depth higher than 5X. Left, T. gondii genomic regions similar to H. hammondi alternative gene model genes; right, H. hammondi genomic regions similar to T. gondii alternative gene model genes. These results also support the conclusion that alternative gene models are not due to poor sequencing quality. Figure 9 Population structure inferred by Admixture analysis. A) Cross validation (CV) plot was drawn for the whole genome SNPs data of T. gondii. Plot displays the CV error versus K values. CV errors for this dataset suggest K = 6 is the best fit. B) Admixture clustering analysis of T. gondii closely resembles the population clustering inferred by Neighbor-net analysis. Each individual in the plot is represented by a vertical stacked column of proportional genetic components of shared ancestry for K = 5 through K = 7. Each color represents each ancestral population. Previously designated haplogroups are indicated as HG. Supplemental Figure 11 Chromosome painting of 62 Toxoplasma gondii strains with local admixture analyses. Local admixture analyses were conducted on SNP blocks of size 1,000 bp on each of the 14 chromosomes. For each SNP block, local admixture assigned strains to a particular ancestral population. Sequences with the same color have high similarity, although this is not meant to imply origin. This analysis also revealed patterns of local admixture that suggest the occurrence of genetic crosses among strains of different clades, likely favored by their geographic proximity. For example, clade A strains TgCkBr141, TgCkCr10, TgRsCr1 and TgCtCo5 harbor a high number of small haploblocks that are shared with clade F strains (see purple bands for TgCkBr141, TgCkCr10, TgRsCr1 and TgCtCo5 strains and pink bands for clade F strains. Interestingly, all clade F strains were isolated from French Guiana while the aforementioned clade A strains were isolated from the nearby locations of Pará State, north of Brazil (TgCkBr141), Costa Rica (TgCkCr10 and TgRsCr1) and Colombia (TgCtCo5) (Supplementary Data 1). A similar high proportion of shared haploblocks occurs among several Brazilians strains from clades A (TgCtBr26, TgCtBr9 and TgCtBr72), B, and C (TgCtBr15, P89, and TgCtBr3)(see gray bands). Supplemental Figure 16 Neighbor-net trees for conserved (A) vs. non-conserved (B) regions of the genome based on total SNPs for the 62 strains. Reference strains for haplogroups (numbers indicated in parentheses) are show in pink. A number of strains group differently in the nonconserved network compared to the conserved network: stains VEG, M7741, and TgShUs28, which are part of clade C, realigned to clade D (blue arrow). Strains TgCtBr9 (6) and TgCtBr72, which are part of clade A, regroup to form part of clade B (red arrow). Finally, strains TgCtPRC2 (1), COUG (11) and GUY-2004-JAG1, which are part of clade D, regroup to a separate long branch (green arrow). Networks were generated as described in Figure 4a. Scale = number of SNPs per site. Figure 17 Phylogenetic trees for SPD and non-SPD genes within conserved regions for Clades C, D, and F. Trees were generated with 100 bootstraps and support values are shown at the branches. Trees generated from SPD genes closely resemble those of the non-SPD genes, reflecting their common, recent shared ancestry. Trees were generated using RMxML with the GTR+GAMMA model as further described in the methods.

Propagation of strains and isolation of gDNA
Sixty two representative strains of T. gondii were selected from different haplogroups from around the world (Dataset 1) 1 Table   1).

Functional annotation of ME49 protein-coding genes
Predicted peptides from ME49 working models were run through JCVI's autonaming pipeline, that assign product names based on a number of sequence similarity searches whose results are ranked according to a priority rules hierarchy. These analyses included Blastp searches against the previous T. gondii ME49 proteome and the GenBank non-redundant protein database, HMM searches against Pfam and TIGRfam 16 databases, and RPS-Blast searches against the NCBI-CDD database 17 .
Proteins without any significant hit to other proteins or protein domains were flagged as "hypothetical protein". The final list of product names was then curated by researchers from the Toxoplasma research community before being assigned to working models.
To predict the possible subcellular localization of predicted peptides, potential signal peptides and transmembrane domains were predicted with signalP and TMHMM programs respectively 18 .
Enzyme Commission (EC) numbers were assigned with PRIAM 19 and curated based on annotations deposited by the scientific community in ToxoDB and those kindly provided by Dr. John Parkinson.
Gene ontologies were annotated based on pfam2go and ec2go mappings, annotations from the ApiLoc database (http://apiloc.biochem.unimelb.edu.au) and those curated by the research community in ToxoDB.

Assignment of gene pub_locus identifiers
ME49 protein-coding genes from the previous genome assembly inherited a modified version of their pub_locus identifiers by the addition of 100,000 to their number suffixes (for example, pub_locus TGME49_012345 became TGME49_112345 in the new assembly). Annotated tRNAs, rRNAs and newly predicted protein-coding genes were assigned completely new pub_locus identifiers. For the annotation of the remaining T. gondii reference genomes and H. hammondi, protein-coding genes that could be mapped from the newest ME49 annotation inherited their pub_locus suffix numbers while the rest of the genes acquired new pub_locus identifiers. Partial genes that based on the evidence were split into two or more fragments on different contigs kept the same pub_locus suffix number followed by a letter, different for each fragment.

Domain Identification and characterization of T. gondii novel gene families
To identify known protein domains the T. gondii ME49 proteome was searched against Pfam and TIGRfam HMM profiles using HMMER3 20  Affymetrix Array data available from NCBI GEO records GSE32427 15 and GSE51780. Data were normalized as described in 24 .

Annotation of rRNA and tRNA genes
Sequences encoding for 5.8s, 18s and large rRNA subunits were extracted from GenBank entries L25635.1 and L37415.1, aligned to the new assembly with Nucmer 3 and automatically annotated with an in-house perl scripts. Transfer RNA encoding genes were predicted with tRNAscan-SE 25 .

Annotation of other T. gondii reference strains
T. gondii ME49 genes were mapped at high and medium stringency to each T. gondii reference assembly with MUMmer and PASA respectively. Genes that mapped without errors with either method were promoted to working gene models while those that failed entered a second round of mapping with EVM. Briefly, coding sequences from failed genes were mapped to the target assembly with gmap 26 . Protein sequences from GenBank NR and Pfam seed databases and from the reannotated ME49 proteome were aligned to contigs with NAP and all RNAseq transcripts assembled with Trinity were mapped with PASA. Ab-initio gene predictions were performed with GlimmerHMM, Genezilla and two runs of Augustus, one of them using gmap-aligned genes as hints. Gene predictions, protein alignments and transcriptomic evidence were then integrated by EVM to annotate remaining working genes on the genome sequences. Lastly, annotated gene structures were updated with PASA based on aligned transcripts and manually inspected.
Functional annotation of protein-coding genes was performed following the same approach as with ME49 with the only exception that genes syntenic to ME49 annotations inherited their product names, Gene ontology (GO) terms and EC numbers while non-syntenic genes acquired their names and other functional annotations from the output of JCVI's autonaming pipeline, PRIAM, ec2go and pfam2go mappings.  Table 1).

Annotation of the H. hammondi assembly
Structural and functional annotations were carried out following a similar approach as the one described for other T. gondii reference genomes. In this case, however, GeneZilla, GlimmerHMM and Augustus were trained with a training set composed of 546 H. hammondi genes that were manually annotated and whose structures were both conserved in T. gondii and full-length supported by T. gondii transcripts. In addition, Augustus was run without hints or using evidence from T. gondii RNAseq/EST assembled transcripts or ME49 coding sequences.

Annotation of the apicoplast genomes from T. gondii reference strains and H. hammondi
Structural and functional annotation of the apicoplast rRNA and protein-coding genes were done manually using as reference the annotation of the apicoplast genome from the T. gondii RH strain (NC_001799.1). Prediction of apicoplast tRNA genes was performed with tRNAscan-SD.

Determination of Pfam domain abundance in coccidians
The proteomes from T. gondii ME49, H. hammondi, N. caninum, S. neurona and E. tenella were queried against the Pfam HMM database using HMMER3. Non-overlapping HMM hits above the trusted cutoff spanning at least 50% of the HMM and representing the best hit (lowest e-value) per protein region were selected for further quantification. For every genome, the abundance of each Pfam domain was then estimated as the number of protein sequences having at least one significant hit against a particular Pfam HMM.

Grouping GO annotations of orthologous groups
Clusters of orthologous groups present in the five coccidian genomes E. tenella, S. neurona, N.
caninum, T. gondii, and H. hammondi were identified using OrthoMCL as defined above. These groups were functionally annotated using GO Slim terms, which are designed to group the many different GO terms into smaller groups of related processes (http://geneontology.org/page/go-slimand-subset-guide). First, the list of GO terms associated with the proteins in the five organisms was inferred from their Pfam assignments, as described above. These terms were mapped onto the smaller-and-broader subset of generic GO Slim terms using map2slim script of the go-perl package, which maps a set of annotations up to their parent GO Slim terms. The GO Slim terms are mapped onto the OrthoMCL groups, and only groups containing the same annotation in >65% of its annotated members are considered.

Mapping non-synonymous SNPs and number of paralogs to iCS382 metabolic model of T.
gondii SNPs from the 16 T. gondii reference strains that correspond to the 382 proteins in the iCS382 metabolic pathway reconstruction of T. gondii ME49 were downloaded from ToxoDB (http://www.toxodb.org/toxo-release4-0/home.jsp) 32 . SNPs were filtered using the following criteria: >80 % of the 16 strains contain an allele (major or minor) that is present in >80% of reads. Further, we considered only non-synonymous coding SNPs (missense and nonsense SNPs) that were present in at least 2 strains to negate the effects of random genetic drift. These criteria resulted in 2,697 nsSNPs. A normalized nsSNP score was generated for each EC number in iCS382 based on the following formula: The number of paralogs for an EC was calculated for all 382 proteins from the orthologous groups of 16 reference T. gondii strains + N. caninum + H. hammondi generated using OrthoMCL at an inflation parameter of 1.4.
Where Otot = number of orthoMCL groups containing a ME49 protein in iCS382 with this EC.

SNP identification
Illumina reads for each of the 61 other genomes were aligned using Bowtie2 --end-to-end 34 against the ME49 reference genome assembly (release date 2013-04-23). Reads were realigned around gaps using the GATK toolkit 35 and a pileup file was generated using samtools 36 . VarScan 37 pileup2snp was used to make SNP calls with -min-coverage 5, -p-value 0.01 and -min-var-freq 0.8 (80% read consistency). These same parameters were utilized to make like-reference calls for each strain at every position in the ME49 genome where any strain had a SNP call. This identified a total of 2,342,433 SNPs across all strains. Positions with informative base calls for all 62 strains were identified, generating a final list of 802,764 SNPs that were used for analysis.

Network and principal components analyses
Genome wide single nucleotide polymorphism (SNPs) were saved as FASTA files and directly incorporated into SplitsTree v4.4 38 to generate unrooted phylogenetic networks using a neighbor-net method and 1,000 bootstrap replicates. Principal components analysis (PCA) was performed by eigenanalysis of a coancestry matrix implemented in fineSTRUCTURE, as described 39 .

Chromosome Ia analysis
SNP data for ChrIa were plotted as a minimum spanning tree using SplitsTree v4.4 38 with 2,000 spring embedded iterations. The 62 strains were clustered into four major groups denoted as monomorphic, divergent, 5'-chimeric, and 3'-chimeric chromosome Ia. SNPs present in each cluster were calculated using a custom script over a 10 kb moving window and plotted using Excel.

Admixture analysis
The population genetic structure of T. gondii was determined by an unsupervised clustering algorithm, ADMIXTURE 40 with ancestral clusters set from k = 1 through 10. The number of ancestral clusters k was determined by estimating the low cross-validation error (CV-error) for different k values using 5-fold CV.

Co-ancestry heatmap
We developed a co-ancestry heatmap by using the linkage model of ChromoPainter 41  Plots were generated in R (http://www.r-project.org/). T. gondii gene families organized in tandem arrays were identified with an in-house perl script.

Analysis of OrthoMCL species-specific genes
Genes found to be specific to T. gondii, H. hammondi, or N. caninum based on OrthoMCL clustering were further analyzed using a combination of sequence alignment tools ( Supplemental   Fig. 8A). First, proteins encoded by species-specific genes were compared against the proteomes of the other two genomes with blastp using an e-value cutoff of 1x10 -10 and at least 50% coverage of the shortest sequence. Genes that passed the cutoff were flagged as "Blastp hit" (see Supplemental Fig. 8B and Supplemental Dataset 6). Species-specific genes without significant blastp hits were then mapped in nucleotide space against the other two genomes with blastn (e-value ≤ 1x10 -10 and coverage ≥ 70% of the query sequence) and those genes having significant and syntenic hits were selected for further analysis. A blastn hit was considered syntenic if the protein encoded by one of the genes located immediately upstream or downstream of the blastn hit on the subject genome was similar by blastp to the protein encoded by one of the genes at either side of the query sequence (evalue < 1x10 -10 and coverage ≥ 50% of the shortest sequence). Species-specific genes without a significant blastn hit as described above were flagged as "Unique" genes (Supplemental Fig. 8B).
Coding sequences (CDSs) from species-specific genes mapped by blastn were then aligned to the other two genomes using GMAP (with parameters -n 1 -A -a 1), a splice-aware nucleotide sequence alignment tool 43 . Proteins from species-specific genes whose CDSs did not aligned with GMAP were then mapped to the other two genomes with tblastn (e-value ≤ 1x10 -5 ). GMAP and tblastn hits were manually inspected to determine alignment coverage and the presence of non-sense mutations or frame-shifts in the subject sequence. Species-specific genes (or their protein sequences for tblastn) that mapped with either method with at least 70% coverage and did not present any in-frame STOP codon or frame-shift on the subject sequence were labeled as "Full Length" (Supplemental Fig. 8B).
Genes (or their proteins) that mapped with less than 70% coverage or presented an in-frame STOP codon or frame-shift on the subject sequence were flagged as "Alternative Gene Model" and constitute potential pseudogenes or functional genes with an altered gene structure compared to the query sequence (Supplemental Fig. 8B). Remaining unmapped genes were added to the gene pool flagged as "Unique".

RNA-seq analysis of unique and alternative gene models in T. gondii.
Assembled T. gondii RNA-seq transcripts generated during the annotation phase of the project were mapped using blastn to CDSs from T. gondii genes that were classified as Unique, Alternative Gene Models, BlastP Hits and Full Length based on their comparison to H. hammondia and N. caninum.
Those CDSs that aligned to a transcript across their entire length with at least 95% identity were flagged as supported by RNAseq data. The statistical significance of the difference in relative abundance of each gene category with or without RNAseq support was assessed using the Fisher's exact test with the R function fisher.test.

Estimation of maximum alternative allele frequency scores and minimum sequencing depth.
Genome sequencing reads used to assemble the genomes of H. hammondi (32,612,714 Illumina reads) and T. gondii ME49 strain (4,697,063 454 and Sanger reads) were mapped to their respective genomes with Bowtie2 followed by SNP and INDEL identification with the utilities mpileup from samtools and call from bcftools with parameters "-cv -p 1" to ensure that all alternative alleles were reported, independently of their allele frequency. Thereafter, the maximum alternative allele frequency (MAAF) score, defined as the maximum allele frequency reached among the collection of alternative alleles identified in every genomic region of interest, was calculated with an in-house perl script. To estimate the minimum sequencing depth (MSD) reached by each locus hit by alternative gene models we calculated the sequencing depth at every nucleotide position of that region with the utility depth of the program samtools using default parameters and then MSD was calculated using an in-house perl script. Binning of MAAF scores and MSD data was carried out with the R program hist.

Regions of co-inheritance
To determine the extent of recombination and co-inheritance of blocks between strains, we examined all possible pair-wise comparisons between the 62 strains (62 x 62). There are 3,844 pair-wise comparisons of which 1,953 are unique. The number of SNPs per 10 kb window across the 14 chromosomes for each pair-wise strain comparison was determined. Low SNP regions (regions of recent co-inheritance, or shared blocks) were identified for 10 kb windows that have 5 or fewer SNPs across a continuous stretch of 10 windows (100 kb) allowing for the intermittent outlier. The ratio of windows meeting this criterion out of all 10 kb windows (6,202) was used as the percentage of the genome two strains share as recently co-inherited (referred to as % shared blocks). A heatmap was generated using the R function heatmap.2 (gplots library (http://www.r-project.org/)) with hierarchical clustering on the % shared blocks value. To analyze the composition of genes within shared regions, strains were grouped by Clade based on this hierarchical clustering: Clade A -18 strains, Clade B -8 strains, Clade C -9 strains, Clade D -8 strains, Clade F -9 strains. Clade E was not included in this analysis as the strains within this Clade are highly similar. The number of SNPs per 10kb window were averaged for all strains within a Clade, and chromosomal regions with low SNP density were identified as above using 10 kb windows that had 3 or fewer SNPs across a continuous stretch of 10 windows (100 kb), allowing for intermittent outliers.

Identification of SPD genes and clustering within the genome
We identified genes that belong to the SPD families (i.e. MIC, GRA, ROP, SRS and TgFAM) based on the annotation of ME49 accounting for CNV in determining the gene number. We then mapped the position of the SPDs onto the assembled ME49 genome and defined those that fell into conserved or non-conserved regions. To determine if gene type was independent of region type we compared the observed frequency of SPDs and non-SPD genes in conserved vs. non-conserved regions of the genome using a Chi-squared analysis. The null hypothesis was that the distribution would be random, and there would be no difference between observed and expected. A P value of ≤ 0.05 was considered significant cause for rejection of the null hypothesis.

Ancestry of conserved and non-conserved regions
The regions of the genome that are "conserved" were defined as the union of low SNP regions for Clades A, B, C , D, and F. We ignored Clade E because it is highly clonal. From these positions, we separated the SNP matrix (all SNPs for 62 strains) into those that were conserved vs. divergent (nonconserved) for the analyses. We reconstructed phylogenetic trees for the conserved and non-conserved regions using maximum likelihood as implemented in RAxML version 7.3.0 with the GTR+GAMMA model 44 . We calculated the standardized Robinson-Foulds (RF) distance 45 between the conserved and non-conserved trees. The standardized RF distance equals the proportion of negative branches in the two trees. In addition, we generated 500 bootstrap trees for the nonconserved region, and calculated the standardized RF distances between the 500 bootstrap trees and the ML tree for the non-conserved region. Those 500 distances represent the variation of the tree estimate due to mutation when the true tree is the non-conserved tree. We used the maximum distance as the threshold to identify the conserved tree that is significantly different from the nonconserved tree. We also estimated phylogenetic tress for the sequence blocks in conserved and non-conserved regions of Clades C, D, and F (where the null hypothesis for random distribution had been rejected). In each of the conserved and non-conserved regions, genes were separated into SPD and non-SPD genes and separate trees were generated for each using 100 bootstrap replicates using RAxML with the GTR+Gamma model described above. The 100 bootstrap trees were combined into a consensus with support values. Trees were considered congruent if they had no conflicting branches with bootstrap support of > 95%.

Phylogeny
To generate a phylogeny that spans across the Apicomplexa, we chose an ortholog shared by all the major taxa that is defined by OrthoMCL OG5_0126701. The gene id for T. gondii is TGME49_249810 and it encodes a 2,749 amino acid protein. In different apicomplexans, this gene is annotated as a DEAD box helicase or activating signal cointegration 1 complex subunit 3. The corresponding protein has regions of high conservation allowing broad phylogenetic comparisons, as well as variable domains that provide better resolution within closely related lineages. Phylogenetic trees were constructed using the Neighbor Joining algorithm with 1,000 Bootstrap replicates as implemented in Geneious ver. 7.1.5 (http://www.geneious.com, 46 ) and visualized with FigTree ver.

Synteny
Briefly, the OrthoMCL ortholog clusters (see above) were reformatted to represent each pair found in the cluster outside of self-matches. Syntenic blocks were generated between all combinations of genomes as described in 47 . A minimum number of three genes in a 25 kb search window up and downstream from each orthologous gene was required to form a syntenic block. Intervening non-syntenic genes were allowed. Custom scripts were used to calculate the total number of syntenic blocks between genomes as well as the number and percentage of genes present in syntentic blocks.
The MCSCAN 48 output was formatted for appropriate input to Circos ver. 0.51 49 for visualization.

Chromosome painting
Local admixture analyses using an enhanced ADMIXTURE algorithm 40 were conducted on blocks of size 1,000 SNPs on each of the 14 chromosomes of T. gondii. Local admixture was used to simultaneously optimize the number of ancestral states for a given genomic region and assign each of the 62 strains to clusters representing these ancestral states. Specifically, for each block of SNPs, we performed local admixture analysis with the number of ancestral states k = 2-7, and then chose the one with the minimum cross-validation (CV) error as the optimal number of ancestral states. The ".Q" output file for the optimal ancestral states provided the probability of each strain being assigned to each ancestral state. If the probability was greater than 0.9, the strain was assigned to the corresponding ancestral state. For each sequence block present in an ancestral state cluster, we colored the region according to clades represented in Figure 4a based on majority rule (i.e. we counted how many sequences in the ancestral state cluster belonged to 1 or more of the 6 groups in Figure 4a and we assigned the color that represented the largest number of sequences to all of the sequences in that ancestral state cluster).