Characterisation of pathogen-specific regions and novel effector candidates in Fusarium oxysporum f. sp. cepae

A reference-quality assembly of Fusarium oxysporum f. sp. cepae (Foc), the causative agent of onion basal rot has been generated along with genomes of additional pathogenic and non-pathogenic isolates of onion. Phylogenetic analysis confirmed a single origin of the Foc pathogenic lineage. Genome alignments with other F. oxysporum ff. spp. and non pathogens revealed high levels of syntenic conservation of core chromosomes but little synteny between lineage specific (LS) chromosomes. Four LS contigs in Foc totaling 3.9 Mb were designated as pathogen-specific (PS). A two-fold increase in segmental duplication events was observed between LS regions of the genome compared to within core regions or from LS regions to the core. RNA-seq expression studies identified candidate effectors expressed in planta, consisting of both known effector homologs and novel candidates. FTF1 and a subset of other transcription factors implicated in regulation of effector expression were found to be expressed in planta.

The pan-genome of many microbial species, including bacteria, fungi 1 and oomycetes 2 comprises core genes, often in syntenically conserved, gene-dense regions of the genome and so-called 'dispensable' genes, located in gene-sparse regions, flanked by abundant transposable elements 3,4 . In Fusarium and other fungal phytopathogens these may be specific areas of conserved 'core' chromosomes and individual lineage-specific 'LS' chromosomes, also known as dispensable or supernumerary chromosomes 1 .
The fungal pathogen Fusarium oxysporum (Fo) represents a diverse group of formae speciales (ff. spp.) causing crown and root rots as well as vascular wilts 5 . These ff. spp. are part of the Fo species complex (containing over 150 members 6 ), show narrow host adaptation, and exploit a broad range of niches. Many Fo isolates are non-pathogenic soil saprophytes and some have even been exploited as biocontrol agents 7,8 .
Bulb onion (Allium cepa L.) is a globally important crop; worldwide production in 2013 was 87Mt 9 . One of the major constraints to production is Fusarium basal rot (FBR), caused predominantly by F. oxysporum f. sp. cepae (Foc). Symptoms of infection include seedling damping off, root rot and basal rot which spreads up through the bulb scales 10 . Like many other Fusarium species, Foc produces resilient, long-lived chlamydospores that survive in the soil for many years 10,11 .
In a recent study, all pathogenic Foc isolates of UK origin were placed in a single clade (divided into 2 sub-clades) based on sequencing of housekeeping genes including EF1-α, whilst non-pathogenic isolates showed much greater diversity and were placed in multiple clades 12 . This suggests a clonal origin of Foc pathogenicity and is supported by work in Japan where all Fo isolates from bulb onion that were shown to be pathogenic were from the same EF1-α or intergenic spacer region (IGS) clade 13 . Previous studies based on EF1α [14][15][16] , rRNA 17 , ISSR markers 17 and RAPD markers 14 have suggested that Foc is more diverse with isolates falling into multiple clades with a potentially polyphyletic origin. However, such studies are rarely associated with large scale pathogenicity 1 NIAB-EMR, New Road, East Malling, Kent, ME19 6BJ, UK. 2  the assembly. Fus2 contig 18 was identified as LS by this analysis; however, later synteny analysis in comparison with Fol showed some conservation of synteny across this contig and it was therefore considered a poorly aligned region of a core chromosome. Of the seven identified LS contigs, four were only found in the three Foc pathogens, representing 3.9 Mb of the assembly, and were designated as "Pathogen Specific" (PS) contigs. The reciprocal alignments of Foc and non-pathogenic Fo isolates against the Fol genome clearly identified known Fol LS chromosomes (Supp. Table 2). An additional 11 Foc Fus2 contigs, (440 kb in total including mitochondrial sequences) were smaller than 200 kb and were excluded from this analysis. Effector annotation of the Fus2 genome. Generic effector-prediction approaches based upon identification of secreted proteins with an effector-like structure (small, cysteine rich) as predicted by EffectorP, secreted carbohydrate active enzymes (CAZymes) ( Table 3, section A) and identification of secondary metabolite synthesis genes (Table 3, section B) were used to investigate Foc and Fo effector complements. Secreted proteins represented 7.6% of the Fus2 proteo and genes encoding proteins with an effector-like structure approximately 1.9% of the proteome. Overall, comparisons between genomes showed no significant differences between the total numbers of Foc and non-pathogenic Fo genes encoding secreted proteins (t = 0.23, df = 4.8, p-value = 0.83), EffectorP proteins (t = −1.29, df = 4.02, p-value = 0.26), or secreted CAZymes (t = −0.72, df = 4.07, p-value = 0.51).

Mimp distribution in the Fus2 genome and prediction of mimp-associated effector candidates.
Mimp sequences are often found in proximity to Fo effectors 4 . Foc genomes were found to contain 136-153 mimps, significantly more than the 25-55 observed in non-pathogenic Fo isolates (t = −13.29, df = 5.75, p-value < 0.01) ( Foc core chromosomes 11-13 are enriched for secreted proteins and cell wall degrading enzymes. Foc core chromosomes 11-13 were noted to contain many genes with an effector-like structure and secreted CAZymes (Fig. 3). Comparison of the density of genes between core, effector-rich core (chromosomes 11-13), non-PS LS and PS regions of the genome (Fig. 4) showed that secreted genes (SignalP) were present at different  densities between core, effector-rich core, LS and PS regions of the Fus2 genome (F(3,14) = 8.072, P < 0.01). Similarly, the density of secreted CAZymes (F(3,14) = 12.04, P < 0.01), EffectorP genes (F(3,14) = 7.506, P < 0.01) and secondary metabolite clusters (F(2,12) = 4.26, P = 0.04) also differed between these regions. Effector-rich core regions were found to contain secreted genes at 50 genes Mb −1 ; double the density at which these genes were found in other regions of the genome (19-24 genes Mb −1 ). Similarly, secreted CAZYmes were highly enriched within this region at 16 genes Mb −1 in comparison to 3-6 genes Mb −1 in other regions. The effector-rich core also    showed high densities of EffectorP genes and secondary metabolite clusters, although these were not significantly enriched in comparison to core regions (Fig. 4).

Identification of effector candidates in PS regions. Fus2 PS regions contained 34 secreted EffectorP
genes, 11 secreted CAZymes and 4 secondary metabolite gene clusters (Supp . Table 3), of which 14, 4 and 3 were within 2 Kb of a mimp respectively. Non-PS LS regions contained an additional 6 EffectorP genes and 7 Secreted CAZymes; in each case two of these were within 2 Kb of a mimp. Of the four PS secondary metabolite gene clusters, one was a terpene synthesis gene cluster and one was a polyketide synthesis gene cluster, neither of which were within 2 Kb of a mimp. The Foc genomes were confirmed to carry homologs of the seven Fol SIX genes (SIX3, 5, 7, 9, 10, 12 and 14) through BLAST searches (Supp. Table 4) as reported previously 12 with two homologs of SIX3 and SIX9 present. All SIX genes identified within the Foc Fus2 genome are located within the four PS contigs 10, 16, 19 and 21 (Supp. Table 5, Fig. 3). Searches for SIX genes in the non-pathogenic Fo isolates from onion also confirmed that isolate PG contained SIX9 12 .
Investigation into sequence conservation between Foc isolates found no non-synonymous variation between EffectorP genes and secreted CAZymes in PS regions. Similarly, very little variation was seen in these effector candidates in other regions of the genome, with only five EffectorP genes and secreted CAZymes showing variation between Foc isolates (Supp. Table 6).
Effector candidates in LS regions show lower codon usage bias than those located in the core genome. Codon usage was investigated across Foc Fus2 genes in core and LS regions (Supp. Table 7). Mean CBI values for secreted CAZY and secreted EffectorP candidates were 0.385 and 0.336 respectively, significantly higher (t-test, p-value < 10 −18 ) than found for all genes (mean CBI = 0.215). In the Fol genome, Ma et al. (2010) showed a shift in preferred codons between LS and core chromosomes, especially towards GC in their third position 1 . In Foc there was no change in GC content across these two portions of the genome (core chromosomes, mean 51.7%; LS chromosomes 51.3%). LS regions (mean CBI = 0.145) contained genes with a significantly lower (t-test, p-value < 2.2e-16) average codon usage bias than in the core genome (mean CBI = 0.221). All genes with close proximity (<2 Kb) to mimps showed relaxed codon usage bias and LS secreted genes showed significant (t-test, p-value = 8.84E-03) codon usage bias (mean CBI = 0.284) compared to the rest of the LS genes (mean CBI = 0.145).

Foc LS regions have higher levels of gene duplication than core chromosomes. Many orthogroups
on Fol and Foc LS regions contained inparalogs, with homologous genes elsewhere in LS regions (Supp. Fig. 1). Duplicated genes in LS regions showed a bias towards being shared with other LS regions, whereas core Foc regions exhibited lower levels of gene duplication, with a concentration of duplicated genes shared with terminal regions of other chromosomes/contigs. In total, 1,424 gene clusters, representing one or more duplication events were identified. Taking one representative focal gene in each cluster and comparing the chromosomal/contig location between pairs of genes in these clusters, 49% of duplication events were observed to be within or between LS chromosomes/contigs (3,186). In contrast, 28% of duplications were between core and LS (1,822) and 23% within or between core and core (1,517) chromosomes. Overall, 1,235 of 2,115 genes on LS contigs contained a paralog in the genome, corresponding to a lower density of duplicated genes on core than on LS chromosomes (permutation test, p-value < 0.001). Dissection of duplications into tandem and segmental did not reveal any further patterns, with only 69 tandem duplications identified.
Open reading frame density is maintained in Foc LS regions. As in Fol, Foc LS contigs showed increased levels of repetitive and low complexity content in comparison to core regions, with 3-15% repeat-masked in core regions compared with 33-59% in LS regions respectively (Supp. Fig. 2). Gene density of Fol LS regions has previously been shown to be lower than on core regions 1 but this was not the case for Foc, where gene density was maintained between core and LS regions (Supp. Fig. 2). Analysis of predicted gene function for Foc LS regions found that Interproscan terms associated with transposon activity were significantly enriched on non-PS LS regions and PS regions (Supp. Table 8, P < 0.05). Terms associated with Helitron helicase transposons were found exclusively on PS regions. Aside from genes with transposon-associated features, PS regions showed enrichment for genes lacking IPR annotations (Supp. Table 8).
Known effectors and effector candidates in PS regions are among the highest expressed genes in planta. Using an established in vitro onion seedling root infection system, expression of Foc genes in planta was explored at 72 hours post inoculation (hpi), a previously identified timepoint when SIX genes are highly expressed 12 . Proximity to a mimp was associated with greater expression, with both non-effector candidates and secreted genes within 2 Kb of a mimp showing significantly greater expression than similar non-mimp genes (Fig. 5). Genes with a putative effector status (using the EffectorP pipeline) in PS regions showed high expression in planta, irrespective of proximity to a mimp.
Foc genes showing the highest expression in planta were further investigated. 21 LS Fus2 genes had equal or greater expression than the 50 highest expressed core genes (Supp . Table 9), with 17 of these on PS regions. These 17 highly expressed PS genes included 11 genes predicted as both secreted and within 2 kb of a mimp and represented previously-identified and novel effector candidates.
Foc PS effector candidates, SIX5 and two SIX3 homologs were the three highest expressed genes. Additional SIX gene homologs also showed high expression in planta, with two SIX9 homologs the 7 th and 17 th highest expressed genes. Other SIX homologs showed lower levels of expression in planta, with SIX10, SIX7 and SIX14 ranking as the 159 th , 137 th and 3277 th highest expressed genes, respectively. Other PS effector candidates represent putative novel effectors. They did not possess any recognisable interproscan domains and did not have an orthologous gene in the Fol gene models. tBLASTx searches against the PHIbase database did not identify any homologs for these genes (e < 1 × 10 −10 ) and many showed no homology to sequences on NCBI (e < 1 × 10 −10 ). Apart from the 11 genes predicted as both secreted and within 2 Kb of a mimp, six additional genes were present on PS regions and also highly expressed. Five of these, including SIX12, were located within 2 Kb of a mimp and one carried a peptidase domain (IPR001506) indicating that some of these genes may represent additional effector candidates. However, two genes were annotated as membrane-bound proteins discounting them as effector candidates. tBLASTx searches against the PHIbase database did not identify any homologs for these genes (e < 1 × 10 −10 ) and searches against genbank (e < 1 × 10 −10 ) found a homolog to only one gene, a hypothetical protein in Fusarium verticillioides (BLASTn; e = 1 × 10 −58 ) with 83% identity to the query sequence.
Non-PS, LS regions of the genome include genes highly expressed in planta. In general, genes on non-PS, LS regions showed similar patterns of expression to genes in core regions of the genome (Fig. 5). However, four genes on non-PS, LS regions had greater or equal expression to the top 50 expressed genes from core regions of the genome (5 th , 8 th , 10 th and 11 th highest expressed genes, Supp. Table 9). These genes were not considered to be effectors (based on mimps, secretion signals, EffectorP or CAZY identification) but were noted to encode three proteins carrying domains associated with formaldehyde activating enzymes (IPR011057, IPR006913) and a polyketide synthase protein (IPR020843) and were found located in close proximity to one another (g15699-g15700, g15704). As such, they may represent a secondary metabolite cluster not identified by Antismash prediction. Further investigation found that genes carrying a formaldehyde-activating domain (IPR006913) were significantly enriched in non-PS LS contigs with eight proteins identified, in contrast to one on PS regions and 15 on core regions (Supp . Table 8b).
Helitrons may play a role in re-arrangement of PS regions. Non-canonical Helitrons appear to be a key feature of Fo pathogenicity chromosomes 29 . In our analysis only Foc PS regions, were enriched for Helitron helicase-like domains (IPR025476) (Supp.   Table 8, panel B), with no annotations found in genes on core regions. A total of 35 genes were identified as helitron/helicase-like transposable elements (IPR025476, PF14214) in the Foc genome, with one on an unplaced contig (contig 33) and the rest located on PS regions. Furthermore, these transposable elements showed evidence of expression in planta, with six of the 35 genes having a mean fpkm >5. These results support recent identification of helitrons in F. oxysporum ff. spp. 29 , and indicate that a Helitron-based mechanism of genomic rearrangement is characteristic of PS regions, rather than core regions, or non-pathogen associated LS regions of the Foc genome.

Transcription factor analysis in LS regions. To investigate the diversity of transcription factors on Foc
LS regions, Interproscan functional annotations were searched for transcription factor domains. This led to the identification of 34 putative transcription factors (Supp. Table 10). Of these, 17 transcription factors were located on non-PS LS regions, of which 12 showed evidence of expression in planta, including a homolog of TF4 (Supp. Table 11b). The remaining 17/34 putative TF's were located on PS regions, and included homologs to previously identified Fol transcription factors TF1, TF3, TF8 and TF9 (Supp . Table 11a). Five PS TF's showed evidence of expression in planta at 72 hpi including a homolog to TF1 (FTF1). With the exception of TF3, each of the previously described TF1-9 genes had a homolog in the core genome, that showed evidence of expression in planta.
Foc carries a distinct complement of FTF1 genes. Due to the association of TF1 (FTF1) with SIX gene expression in several Fo ff. spp. 24 , the FTF gene family was further investigated for all the sequenced Fo isolates from onion. A single orthogroup was found to contain all FTF genes previously described in Fo strains, Fo47 and Fol-4287 25 as well as the BLAST homologs of FTF genes in Foc and Fo. Alignment and phylogenetic analysis of these genes allowed separation of FTF genes in FTF1 and FTF2 families (Fig. 6). All Foc isolates were found to carry a single copy of FTF2, which was located in the core genome of Fus2. Two copies of FTF1 were found in Fus2, which were both present in PS contig 19. The two other Foc isolates (125, A23) carried an identical FTF1 gene in a single copy. Interestingly, FTF1 homologs were identified in non-pathogenic Fo isolates A13, PG and Fo47, although these gene sequences were distinct from those in Foc (Fig. 6).

Discussion
In this study, through the use of single molecule sequencing, the structural conservation of lineage-specific (LS) chromosomes has been revealed in the onion basal rot pathogen F. oxysporum f. sp. cepae through comparison with F. oxysporum f. sp. lycopersci. Using a comparative-genomic approach, this work has revealed for the first time that LS chromosomes can be subdivided into pathogen-specific (PS) and nonspecific chromosomes that are specifically associated with Foc. RNAseq has shown that known effector candidates, conserved in other Fusarium ff. spp., as well as novel effector candidates are expressed in planta, paving the way for future functional studies and effector-informed resistance breeding approaches.
Previous studies have indicated that many other ff. spp., including Fol, have a polyphyletic origin 30 with some exhibiting clear patterns of 'race' evolution through stepwise loss of effectors 31 . All pathogenic Foc isolates formed a single clade, consistent with a single origin of pathogenicity on onion 12 . Examination of other non-pathogenic Fo isolates from onion showed that these were interspersed throughout the phylogenetic tree, with two isolates forming a sister clade to Foc but other strains, grouping with diverse pathogenic ff. spp. Foc may therefore represent one of the minority of Fo ff. spp. with a monophyletic origin. Our work has revealed a unique complement of 'SIX' gene homologues as well as other candidate effectors that are specific to Foc and are completely conserved across more than 40 isolates from Australia, Chile, Colombia, Czech Republic, Netherlands, South Africa, Spain, UK and USA 12 .
Despite the clear conservation of synteny between Foc and Fol core chromosomes, synteny was not conserved between Foc and Fol LS regions. Interestingly, the LS regions identified in Foc appeared to be much smaller than in Fol, with a total of 5.7 Mb identified in Foc versus 14 Mb in Fol. Designation of 3.9 Mb of Foc LS contigs as PS regions supports previous findings that Foc SIX3 is located on a ~4 Mb chromosome 13 , indicating that a single pathogenicity chromosome is present in Foc.
Subdivision of LS regions of the Foc genome into PS and non-PS regions showed for the first time that non-pathogenic isolates contain certain LS elements and their presence is independent of the position of the isolate in the phylogenetic tree. For example, contig 14 present in all pathogenic Foc strains, appears to be conserved in non-pathogenic Fo isolates A13 and A28 (distantly separated in the phylogeny), but not in isolate CB3, which is most closely related to A28. This suggests that there may be genetic exchange between divergent lineages of non-pathogens in their LS chromosome complement. The functional significance of these non-pathogen associated LS chromosomes is a topic for further investigation.
Structural analysis revealed chromosome-specific patterns of gene duplication, with most gene duplications being segmental and within LS regions. This was consistent with the breakdown of synteny observed between Foc and Fol LS regions. Transposon activity has been shown to drive evolution of LS regions in other fungal pathogens 32 . The finding that Helitron-containing transposons are restricted to PS chromosomes suggests that they may be important for the high degree of rearrangements within PS chromosomes. This may be evolutionarily advantageous in a clonal organism as it may facilitate rapid more adaptation, not only through a higher rate of large-scale genomic deletions (which may be adaptive when evading host recognition) but also in preventing the accumulation of linked deleterious mutations from interfering with selection, the so-called Hill-Robertson effect 33 . One question which requires further study, is how helitron helicase transposons are limited to PS chromosomes given there is some evidence for expression.
Similar numbers of effector candidates were identified in both pathogenic and non-pathogenic Fo isolates from onion, mainly due to the large number of secreted proteins present in the core genome and on accessory chromosomes, irrespective of pathogenicity. Despite being conserved in Fo, Fol chromosome 12 has been reported as conditionally dispensable 34 . A concentration of effector-like genes on the homologous Foc chromosome indicates a functional distinction between Foc effector-rich core chromosomes and the remaining core genome. Foc contains a large complement of core effectors, with enrichment on core chromosomes 11, 12 and 13 (Fig. 4). However, these core candidate effector genes were not highly expressed in planta. LS contigs present in non-pathogenic isolates from onion contained far fewer mimp sequences, at a level similar to the core genome, far below the levels seen in PS contigs, again highlighting the specific differences between PS chromosomes and the rest of the LS and core genomes.
Candidate effector genes on Foc PS regions showed high levels of expression in planta; the highest expressed genes were homologs to known SIX (most notably SIX3 and SIX5) genes but also included novel effector candidates. It will be important to test whether the I-2 resistance gene, identified in tomato recognises the Foc variants of SIX3 35 . Additional non-PS LS regions were also identified; these also possessed high numbers of mimps, and included highly expressed genes.
Transcription factors have previously been identified in the PS chromosome 14 of Fol (13, representing nine gene families; TF1-9) 24 . Foc was found to carry homologs to each of these previously identified TF1-9 genes distributed throughout the core and LS regions, but with no homolog of TF3 in the core genome 24 . However, similar to the different complements of SIX genes between Fo ff. spp., the pattern of TF1-9 genes present on PS regions in Foc was distinct from Fol. Identification of homologs to known TFs regulating pathogenicity indicates a conservation of transcriptional regulation between Fol and Foc. The role of novel TF candidates on PS regions requires further investigation, including their conservation through comparisons with other Fo ff. spp.
One of the major objectives of characterising the genomic basis of pathogenicity is to inform resistance breeding approaches using information about the effector complement of the pathogen 36 . The durability of a single resistance gene is dependent upon the necessity of the detected effector for the infection process and the adaptive potential of the effector gene. Effectors can either mutate to evade recognition by an R gene, such as SIX3 (AVR2) in Fol race3, or be lost such as SIX4 in Fol races 2 and 3 28 . It is only through the characterisation of effector function, combined with a population genetics approach, that an assessment can be made about the long-term utility of any R gene based resistance in breeding, or the breeding approach needed to effectively deploy the available resistance. It is therefore important that future work addresses the functional essentiality of effectors, the global diversity in the Foc population, the ability for effectors to mutate and evade recognition and the extent of R gene based resistance in onion.
Genome assembly. PacBio reads for Foc isolate Fus2 were assembled using Canu and polished using Illumina MiSeq reads in Pilon to correct erroneous SNPs and InDels 37,38 . De novo assembly of MiSeq data for the remaining six genomes was performed using Spades v.3.5.0 39 . In all cases Quast 40 was used to summarise assembly statistics and BUSCO 41 used to assess completeness of gene space within the assembly. Assemblies were edited in accordance with results from the NCBI contamination screen (run as part of submission to Genbank in November 2016) with contigs split, trimmed or excluded as required. RepeatModeler, RepeatMasker and trans-posonPSI were used to identify repetitive and low complexity regions (http://www.repeatmasker.org, http://transposonpsi.sourceforge.net). In addition to generating de novo assemblies, Illumina sequencing reads were mapped to the PacBio Fus2 assembly. Alignment was performed using Bowtie2 v.2.2.4 before bedtools-intersect was used to determine number of reads aligning across 100 Kb windows in Fus2 contigs 42,43 . Whole-genome phylogenetic analysis. A thirty gene phylogenetic tree was constructed using selected single copy genes present in the BUSCO ver. 1.22 Eukaryota fungi list for all the Fo isolates sequenced in this study as well as additional publically available Fusarium spp. genomes 41 44 . CDS sequences of single copy genes conserved across Fungi identified by BUSCO ver. 1.22 41 that were found to be complete and single copy in all the genome assemblies from this study were used as input in preparing the BEAST phylogenies. In total, 652 such genes were identified and aligned with MAFFT ver. 7.222 45 . The alignments were inspected visually with MEGA7 46 , trimmed and 30 genes in the top 5% highest nucleotide diversity selected for further phylogenetic analysis. A best-fit sequence evolution model for each gene was determined with PartitionFinder ver. 1.1.1 47 using BIC (Bayesian Information Criterion). The gene trees derived from the set of 30 single copy genes were investigated with multi-locus analysis in *BEAST ver. 2.4.2 48 . For each replicate *BEAST run, 300 million MCMC iterations, sampled every 10,000 chains, were run due to the size of the dataset. The molecular clock was set to strict due to intra-specific sampling and consequent expectation of lower inter-branch rate variation 49 . A Yule prior was placed on species tree and population size model set to follow linear growth with a constant root. The first 10% of results was discarded as burn-in, and run convergence and stationarity were inspected in Tracer ver 1.6 (http://tree. bio.ed.ac.uk/software/tracer/) to confirm that ESS scores for all estimated parameters reached at least stable 200. This was followed by generation of maximum clade credibility tree with median heights with BEAST's TreeAnnotator and visualisation of the species trees in FigTree ver. 1.4 (http://tree.bio.ed.ac.uk/software/figtree/). Run convergence was established by performing three independent runs and checking the variability of results across runs; a single representative run is reported here.
Identification of LS regions. LS regions were identified in the Foc Fus2 genome and the previously characterised Fol 4287 assembly using MUMmer v3.23 (PROmer -mum, delta-filter -g) 50 . Repeat-masked assemblies of non-pathogenic Fo, Foc and Fol isolates were aligned against the repeat-masked reference genome 51 . The percentage of unmasked bp covered by aligned sequence was calculated for each reference contig and a threshold of 30% identity was set as the boundary at which a contig was described as present or absent in a strain.
In planta RNAseq. RNAseq data was used to aid gene prediction and assess expression of effector candidates. Onion seedlings were inoculated with either the standard pathogenic Foc Fus2 or the non-pathogenic Fo47 isolate using a sterile, square petri dish system as previously reported 12 . Three replicate plates were set up for each isolate and RNA was extracted from pooled root samples taken from five plants at 72 hpi. Library preparation was carried out using a TruSeq RNA Sample Prep kit V2 (Illumina) and RNA sequencing carried out using an Illumina HiSeq machine with 100 bp paired end reads. Samples were multiplexed over two runs to give approximately 50 million reads per sample.
Gene prediction. RNAseq reads were aligned to de novo assembled genomes using Bowtie2 v.

and
Tophat v.2.1.0 to aid training of gene prediction programs 42,52 . An initial RNAseq alignment was used to estimate "mean insert size" and "fragment length distribution" of RNAseq reads and Tophat alignments re-run using these Functional annotation and effector prediction. Draft functional annotations were determined for gene models using InterProScan-5.18-57.0 62 and through identifying homology between predicted proteins and those contained in the July 2016 release of the SwissProt database 63 (using BLASTP (E-value > 1 × 10 −100 ). Interproscan terms were used to test for enrichment of functional domains within PS and non-PS LS regions. Abundance of each interproscan term was tested using Fisher's exact test, comparing number of genes carrying the annotation to those without. Benjamini Hochberg correction was applied for multiple testing.
Putative secreted proteins were identified through prediction of signal peptides using SignalP v.4.1 and removing those predicted to contain transmembrane domains using TMHMM v.2.0 64,65 . Additional programs were used to provide additional sources of evidence for effectors and pathogenicity factors. EffectorP v1.0 was used to screen secreted proteins for characteristics of length, net charge and amino acid content typical of fungal effectors 66 . Secreted proteins were also screened for carbohydrate active enzymes using HMM models from the CAZY database 67 and HMMER3 68 . Regions of the genome containing secondary metabolite gene clusters were identified using the Antismash 3.0 webserver 69 . Locations of gene clusters were parsed to gff3 format before genes intersecting these regions were identified using Bedtools.
Genes within 2 Kb of a mimp sequence were identified using the consensus sequence for the mimp 3′ inverted repeat 4 . This was searched against assembled genomes using Perl regular expressions /CAGTGGG.GCAA[TA] AA/ and /TT[TA]TTGC.CCCACTG/. Genes within 2 Kb of these mimp sequences were marked as candidates for being under the influence of a mimp-containing promoter.
Differences in total numbers of genes encoding secreted proteins, secreted CAZYmes and secreted EffectorP proteins between Foc and Fo isolates were assessed using t-tests in R. Differences in density of genes encoding secreted proteins in different regions of the Fus2 genome were tested using ANOVA in R, with pairwise differences between regions assessed using t-tests with bonferroni correction. Identical analysis were performed to assess density of secreted CAZYmes, secreted EffectorP proteins and density of secondary metabolite clusters by genomic region.
Codon bias amongst putative pathogenicity-related genes. Multivariate correspondence analysis of codon usage to detect presence of codon bias was first investigated using codonW ver. 1.3 (http://codonw. sourceforge.net/). Prior to codonW runs, Foc Fus2 coding sequences (one longest representative sequence per gene) were filtered to remove genes with potentially unusual codon usage stemming from their foreign origin which could indicate mis-annotated false positives; transposon genes (see above) but also genes with no domain annotation.
After identifying preferred codons, differences in codon usage between different classes of putative effector genes situated on lineage-specific (LS) and core chromosomes were investigated. Two statistics summarising gene codon usage bias were calculated in addition to RSCU and EN c ; frequency of optimal codons (F op ) and Codon Bias Index (CBI). Codon usage can be influenced by the selection on the sequence GC content so overall GC content of each gene (GC) and GC content in the third position of synonymous codons (GC3s) were also calculated by codonW. Pairwise correlation between all codon bias metrics, expression level, GC, GC3s content were investigated with Spearman's rank correlation, and differences in codon usage bias between different subsets of genes compared with a t-test. All the statistical analyses were carried out in R.

Gene expression.
Expression values for predicted genes were determined using Cufflinks to quantify fpkm values of RNAseq reads aligned to the genome during gene prediction. A mean fpkm value was taken for each gene from the three technical replicates. Expression of genes belonging to different effector categories and in different regions of the genome was investigated using negative binomial generalised linear model (glm), with log transformation using the glm function in R and other base functions 57  (Core, Effector-Rich Core, non-PS LS and PS) gene type (Secreted, CAZyme, EffectorP, non-effector) and whether the gene was within 2 Kb of a mimp (Yes, No). Combinations of terms were combined into a single input factor into the glm. This allowed removal of four combinations that were not present in the dataset, as no CAZyme, EffectorP or Secreted genes were within 2 Kb of a mimp and found on core chromosomes, also no secreted genes were within 2 Kb of a mimp and were located on effector rich core regions.