Extensive epigenetic modification with large-scale chromosomal and plasmid recombination characterise the Legionella longbeachae serogroup 1 genome

Legionella longbeachae is an environmental bacterium that is the most clinically significant Legionella species in New Zealand (NZ), causing around two-thirds of all notified cases of Legionnaires’ disease. Here we report the sequencing and analysis of the geo-temporal genetic diversity of 54 L. longbeachae serogroup 1 (sg1) clinical isolates, derived from cases from around NZ over a 22-year period, including one complete genome and its associated methylome. The 54 sg1 isolates belonged to two main clades that last shared a common ancestor between 95 BCE and 1694 CE. There was diversity at the genome-structural level, with large-scale arrangements occurring in some regions of the chromosome and evidence of extensive chromosomal and plasmid recombination. This includes the presence of plasmids derived from recombination and horizontal gene transfer between various Legionella species, indicating there has been both intra- and inter-species gene flow. However, because similar plasmids were found among isolates within each clade, plasmid recombination events may pre-empt the emergence of new L. longbeachae strains. Our complete NZ reference genome consisted of a 4.1 Mb chromosome and a 108 kb plasmid. The genome was highly methylated with two known epigenetic modifications, m4C and m6A, occurring in particular sequence motifs within the genome.

www.nature.com/scientificreports/ identified via Gubbins (3045) and the isolates belonged to two main clades (Fig. 1). The larger clade consisted of isolates from several regions in NZ and shared a date of common ancestor between the years 604-1755 CE (95% HPD interval), while the smaller clade consisted of isolates from the Canterbury district only, sharing a date of common ancestor between 1904 and 1986 CE (95% HPD interval; S1 Figure). Given the ad hoc nature of our sampling and the strong bias towards the Canterbury region, it is not clear if the smaller clade truly reflects the distributions of these lineages within NZ. The oversampling of Canterbury isolates is primarily a historical consequence stemming from the 1990's when pneumonia aetiological studies in this region revealed the importance of Legionella as a cause of pneumonia 21,22 . As a result, systematic routine testing, including a commitment to culture patient specimens allowing the isolation of clinical strains, was initiated, which subsequently lead to a disproportionate number of isolates being derived from the region. Recently it has been shown that when routine, systematic testing similar to that used in Canterbury is performed in other NZ regions, there are several that have higher or similar rates of LD 2 . Thus, the higher rates previously observed in Canterbury appear to predominantly reflect the testing regime employed rather than some special or intrinsic quality of the region. Future studies that include more isolates from outside Canterbury will clarify this and potentially identify pathways that has led to its introduction to NZ.
One potential explanation for the splitting of the two clades is that there have been two separate introductions into NZ. L. longbeachae has recently been detected by qPCR on the bark of live pine trees 23 , particularly on the species Pinus radiata, which is an important commercial crop and the most common pine tree species grown in NZ 24 . It was first introduced in the 1850's, but the boom in commercial forestry did not begin until 1920-1930. This boom coincides with the date of the common ancestor of clade 2 and many sub-clades of clade 1. This suggests L. longbeachae may have been introduced to NZ with Pinus radiata followed by relatively rapid evolution and dispersal. The swift evolution and spread of disease-causing strains may be a feature of Legionella. It has also been observed in L. pneumophila, where David et al. 20 , showed phylogenetic evidence of rapid dispersion and the emergence of disease-causing strains in man-made environments over the last 100 years.
In our plasmid analyses, three additional L. longbeachae plasmids were identified through further sequencing of two of our clinical isolates, B1445CHC and B41211CHC. Isolate B1445CHC was found to contain two plasmids of 73,728 bp and 150,426 bp (pB1445CHC_73k and B1445CHC p150k), while B41211CHC contained only one plasmid of 76,153 bp (pB41211CHC_76k). Alignment of the NZ L. longbeachae read sets to the Legionella reference plasmids (pNSW150, pB41211 CHC_76k, pB1445CHC_73k, pF1157CHC, pB1455CHC_150k and pLELO) demonstrated that some of the NZ isolates contained an exact copy of the reference plasmids investigated, some contained similar plasmids with reads aligning to sections of the reference plasmids, and some contained reads that aligned to sections of more than one reference plasmid (Fig. 1). This illustrates that the Legionella plasmids share a common back bone separated by variable regions and that there is extensive recombination amongst them (S2 Figure). The plasmid results also correlated with the clades (Fig. 1) identified via phylogenetic analysis, suggesting that plasmid recombination events may pre-empt the emergence of new L. longbeachae strains.
Our global analysis that included available sequence information from 89 L. longbeachae isolates from the United Kingdom (UK) and NZ were found to share 3219 core SNPs and belonged to multiple small clades. Most of the clades consisted of isolates from a single country, whilst a small number had isolates from both countries (S3 Figure), indicating some recent global transmission.
Genetic diversity of L. longbeachae sg1 clinical isolates. Given there are few complete L. longbeachae genomes available we chose one of our sg1 isolates as the reference genome to further analyse our other 53 NZ isolates. Isolate F1157CHC was sequenced using both Illumina short read sequencing in the initial comparative dataset, and then it was subsequently sequenced with PacBio long read sequencing. To visualise the data from the comparison of the 54 samples in the dataset and to show multiple facets of this study simultaneously, an overarching Circos figure (Fig. 2) was generated using the complete PacBio genome of F1157CHC as a backbone. The tracks in the figure are described in detail in the legend. Overall it can be seen that the regions detected for recombination by Gubbins are unevenly distributed, with some clusters around the genome (~ 600 kb, ~ 800 kb, ~ 1900-2050 kb), and a large, slightly less dense region (2300-2800 kb). In total, 655 protein coding genes are at least partially included in these regions of high recombination, and they are not clearly associated with any functional class (S4 Figure).
As indicated in the gene rings in Fig. 2, the genome of F1157CHC was functionally annotated and categorized using the amino acid sequences from the NCBI PGAP predictions against the eggNOG-mapper database (v. 2.0). The genes were coloured according to their COG categories (Fig. 2). We found that 3410 (94.14%) returned an annotation result, and of those 2741 (80.38%) were categorized with COG functional categories (S1 Table). In terms of the performance of the eggNOG server, this level of annotation for L. longbeachae is slightly above the level of 76% reported for the Legionellales order, and close to the eggNOG v. 5.0 database average of 80% 25 . The main functional groups (those with a single COG category definition) accounted for 2522 (73.96%) of the annotations. As expected, the functions of the genes encoded on the chromosome and plasmid differ (S5 Figure), where it carries more genes of unknown function (category S) and those associated with replication, recombination and repair (category L). COG category S-"function unknown"-is the largest single category, and accounts for 547 (16.04%) of the returned annotations.
We used our set of 53 draft genomes to investigate both the core and pangenome. A genome summary of these 53 draft genomes, plus the complete genomes of F1157CHC, NSW150 and FDAARGOS_201, can be found in S2 www.nature.com/scientificreports/ www.nature.com/scientificreports/ (282.7 ± 5.5). These differences are small and not significant, suggesting genome-completeness doesn't strongly influence our ability to annotate genomes. The only exception is the number of rRNAs, which are much more numerous in the complete genomes (12 copies each) than the draft genomes (6.89 ± 0.42), reflecting the difficulty in assembling highly repetitive regions from short read sequencing data.
In comparison to the recent study of Bacigalupe et al. 17 , (n = 56, predominantly UK sg1 isolates), we found that the range of our coding sequences was similar to the 3558 genes they reported. We cannot say if there is any real difference in gene numbers between the NZ strains reported here, and those in the Bacigalupe et al., study 17 because only summary gene numbers per genome were provided, but it seems unlikely.
Using Roary 26 , we found a pangenome of 6517 genes, and a core genome of 2952 genes amongst 56 isolates (our 54 isolates, NSW150 and FDAARGOS). This is ~ 86.3% of the number of genes in the F1157CHC genome, indicating a large core genome and a small accessory genome amongst the isolates in this study. Bacigalupe et al. 17 , also reported a core genome (2574 genes) and pangenome (6890 genes), which were over a shorter, but contemporaneous, timeframe. Given the isolate numbers are almost the same (excluding reference isolates), but the methodologies differ for calculating the core genome, it is interesting to observe a smaller number of genes in the core but a larger number of genes in the pangenome. It is tempting to speculate that there might be a smaller gene repertoire for L. longbeachae in NZ, possibly a result of its relative geographical isolation, or maybe environmental conditions are different, requiring the use of different sets of genes to survive within NZ soil. Using the categories defined within Roary, we found 157 genes (95 to 99% of strains) in the soft core category, 865 (15 to 95%) in the shell category and 2543 (0 to 15%) in the cloud category (S1 File).
Currently, there are 61 recognised species and 3 subspecies within the Legionella genus (http:// www. bacte rio. net/ legio nella. html). Of these, 58 have at least draft genome sequences available, which aid in understanding the evolution of the genus 27 . A core genome has been estimated to be only 1008 genes, highlighting genus diversity. With a GC content of ~ 39% and a genome of ~ 3.3 Mb, L. pneumophila is regarded as the most clinically important 1 . A recent Australian study 28 has estimated the core genome of this species as 2181 genes, which is 36.7% of the pangenome's genes (5919 genes). In comparison, in our study, analogous numbers indicate 45.3% for L. longbeachae, suggesting its genome is probably more stable than the L. pneumophila genome.
Finally, we used FastGeP 29 to perform an ad-hoc whole genome MLST analysis of the 56 isolates using the 3420 CDSs in the F1157CHC reference genome. We found 2756 loci were shared, of which 1321 (47.93%) were identical at the allelic level. One-hundred and eight of the shared loci were excluded because of hypothetical gene duplications, and 664 were excluded because of incomplete information, such as missing alleles, truncation or nucleotide ambiguities. After removal of these loci, 2648 (of which 1327 were polymorphic) were used to construct the distance/difference matrix. As FastGeP is looking at allelic distances in a gene, the distance between two alleles is independent of underlying sequence differences. Visualization of the FastGeP matrix in iTOL 30 is shown in S2 file. Antibiotic resistance and virulence factor genes. Unsurprisingly, our 54 L. longbeachae sg1 isolates all contained a chromosomal class D β-Lactamase gene homologous to bla OXA enzyme family. This 795 bp bla OXA-like gene, whose phenotypic features are uncharacterized, is also found in L. oakridgensis (100% nucleotide match). Twenty-one isolates also have another molecular class D β-Lactamase with 100% nucleotide match to bla  that are contained on a plasmid similar to L. pneumophila pLELO. The bla OXA-29 gene was first identified in the Fluoribacter gormanii type strain ATCC 33297 T (Genbank accession number NG_049586.1 31 ). The majority of the known class D β-Lactamases are found on mobile genetic elements and indicate the intra-species transfer of bla OXA-29 on conjugative plasmids amongst the various Legionella species such as L. pneumophila, L. sainthelensi, and L. hackeliae. This bla OXA-29 β-Lactamase is part of a group of structurally related serine enzymes that have high activity against penicillins, reduced activity against cephalosporins and no activity against carbapenems 32 .
All isolates also contained a previously identified tetracycline destructase gene, tet56 that confers tetracycline resistance when expressed 33 . Tet56 belongs to a family of flavoprotein monoogenoxygenases that inactivate tetracycline antibiotics through covalent modification to the antibiotic scaffold 33,34 . Previously, the antimicrobial susceptibilities of 16 isolates that were sequenced in our current study had been investigated 35 . For these isolates, the tetracycline MIC 90 was found to be high, ranging between 16 to 64 mg/mL when the isolates were grown in BYE broth, suggesting tet56 was expressed and the protein was functional in these isolates (S3 Table).
Virulence factor database analysis showed our 54 isolates as well as the NSW150 and FDAARGOS_201 complete genomes had a near identical pattern with between 33 and 36 virulence factor genes (S4 Table). Many of these encoded various components of the type IVB Dot/Icm secretion system (T4SS), which is essential for its virulence and found to be present in all Legionella species examined to date 36 .
Legionella longbeachae chromosome and plasmid architecture. Our complete chromosome for isolate F1157CHC has been published 19 , and therefore the description of this genome is kept relatively brief and is more comparative in nature with the other available reference L. longbeachae genomes (NSW150, and FDAARGOS_201).
We compared all three reference genomes using the MAUVE plugin within Geneious (v 9.1.8) and the results are shown in Fig. 3. At 4,142,881 bp, F1157CHC is larger by 65,549 bp when compared to NSW150 and smaller by 19,851 bp to FDAARGOS_201. Overall, the genomes of F1157CHC, NSW150 and FDAARGOS_201 are similar in their organisation with the MAUVE alignment showing four (81-2264 kb) collinear blocks in the genomes, called LCB1, LCB2, LCB3 and LCB4. At an overall genome level, the order and orientation of these blocks indicates a greater similarity between NSW150 and F1157CHC, while FDAARGOS_201 is slightly different (S5 Table). www.nature.com/scientificreports/ Three of these blocks (LCB2, LCB3 and LCB4) are found in all three genomes, and a further one (LCB1) is found only in NSW150 and FDAARGOS_201. The genomic coordinates and the percentage of the collinear block that contains genomic sequence are described in S5 Table. In addition, there are two and three small regions in two of the genomes that are not found in collinear blocks totaling 4.2 and 4.4 kb for NSW150 and FDAAR-GOS_201, respectively. For FDAARGOS_201 and NSW150, two of these unique regions are found flanking the shortest collinear block of 81 kb (LCB1), and for NSW150 the third region is a short sequence at the start of the chromosome (unusually for this chromosome the start of the dnaA gene is not annotated to be at position 1). The LCB1 block shows the greatest disparity in content with the genomic length in NSW150 being 31.3 kb, but 73.6 kb in FDAARGOS_201, hence there are many gaps in the collinear block alignments.
It should be noted that as the MAUVE aligner within Generous works on a linear chromosome, the LCBs at the end of the chromosome form part of the same larger collinear block, meaning that on the circular chromosomes there are in effect only three blocks, with the ~ 1807 kb block LCB3 being flanked by the content variable 81 kb block LCB1. There are thus only a few boundaries around the main collinear blocks. The boundary between LCB2 and LCB3 in FDAARGOS_201 and FH1157CHC occurs within the traF gene, part of the tra operon. The organization is more complex in NSW150 where the 31.5 kb block of LCB1 and a 3.9 kb region containing three hypothetical genes is found between LCB2 and LCB3, with the tra operon being found on LCB1. The tra operon is important for pathogenicity because it forms part of the T4SS for the transfer of plasmids via conjugation 37 . The other main boundary between LCB3 and LCB4 for all three chromosomes, the transfer messenger RNA (tmRNA) ssrA gene is present at the end of LCB4. The tmRNA genes are part of the trans-translocation machinery, which can overcome ribosome stalling during protein biosynthesis. Trans-translocation has been found to be essential for viability within the Legionella genus, with the ssrA gene being unable to be deleted in L. pneumophila 38 . Under the control of an inducible promoter, it was found that decreasing tmRNA levels led to significantly higher sensitivity to ribosome-targeting antibiotics, including erythromycin 38 . At the end of LCB3 in F1157CHC and NSW150, there is an IS6 family transposase and an SOS response-associated peptidase, but little is known about these genes. The flanking gene in FDAARGOS_201 comes from a small orphan block of 1.3 kb between LCB1 and is a short DUF3892 domain-containing protein, as defined by Pfam 39 . Whilst having unknown function it is found widely across bacteria and archaea, and within the Legionellales order.
As described above, the collinear blocks include gaps, and except for LCB1, all other defined blocks in the isolates are found with the genomic length being greater than 87% of the block length. Within the blocks themselves, LCB1 shares a common region of ~ 23.8 kb and a larger non-overlapping (i.e. different gene content) region in FDAARGOS_201 compared to NSW150. For the remaining three blocks there are combinations of absence and presence of genetic material within these blocks across the three isolates. For the regions over 10 kb, these can be summarized as regions that are present in only a single isolate (37.1 kb in LCB3 of F1157CHC), or in two isolates (12. The gene content in these blocks is varied, and the boundaries close to tRNA genes, site-specific integrase genes, SDR family oxidoreductase genes, ankyrin repeat domain-containing genes, or in intragenic space, but for some of the boundaries transposase genes (IS3, IS4, IS6, and IS926 families) are involved. In bacteria, tRNAs have been shown to be integration sites 40 , so finding them at collinear block boundaries is unsurprising.
Only NSW150 and F1157CHC were found to contain a plasmid (pNSW150 and; pF1157CHC 19 ). At 108,267 bp pF1157CHC is 36,441 bp larger when compared to pNSW150. To assess plasmid architecture more fully, the three additional L. longbeachae plasmids we obtained from further sequencing of two of our isolates www.nature.com/scientificreports/ (pB1445CHC_73k, pB1445CHC_150k and pB41211CHC_76k) were aligned using MAUVE and visualised in Geneious, (Fig. 4). The plasmids share a common backbone consisting of conjugational genes (yellow collinear block), ranging in size from ~ 25,000 to ~ 28,000 bp, as well as several other collinear blocks that vary in size and orientation (Fig. 4). These blocks are separated by variable regions around mobile genetic elements, such as insertion sequences. Analysis of the larger plasmid pB1445CHC_150k revealed this is the same as plasmid pLELO, first reported in L. pneumophila. MAUVE alignment of the L. longbeachae plasmids, pLELO and two L. sainthelensi plasmids (pLA01-117_165k and pLA01-117_122k 41 ) (S6 Figure) again shows Legionella plasmids have a common backbone including conjugational genes separated by variable regions. Although the number of plasmids in our analysis is limited, Legionella plasmids identified to date can be broadly divided into two groups; one consisting of the smaller plasmids of ~ 70 kb that appear to be primarily a L. longbeachae group (pNSW150, pB1445CHC_73k, pB41211CHC_76k) and another group consisting of larger plasmids that occur in various species, including our complete genome (pF1157CHC, pLELO, pLA01-117_165k). This again suggests there has been extensive plasmid recombination followed by both intra-and inter-species transfer, supporting the findings of Bacigalupe et al. 17 .
Interestingly, pB1445CHC_73 k has a repetitive region that was identified as a clustered regularly interspaced short palindromic repeat (CRISPR) element. This element belongs to the type I-F system with the same repeat region between 20 to 33 spacer regions and cas1-cas6f associated enzyme (Fig. 5). While there are few reports of naturally occurring CRISPR-Cas arrays on plasmids, previous studies 42, 43 as well as a recent comparative genomics analysis of available bacterial and archaeal genomes has demonstrated that type IV CRISPR-Cas systems are primarily encoded by plasmids 44 . There have also been similar reports of a type I-F CRISPR-Cas array being present on the plasmids of L. pneumophila strains 45,46 . Further analysis of our other L. longbeachae isolates showed that the type I-F CRISPR-Cas element is also present in 5 other strains (F2519CHC, LA01-195, LA01-196, LA03-576).
Legionella longbeachae methylome. The PacBio assay utilized in the current study is unable to detect 5mC modifications. However, methylome analysis of our F1157CHC genome identified two classes of modified base, N4-cytosine (m 4 C) and N6-methyladenine (m 6 A). Bases in the chromosomal sequence were more likely to be modified (1.49% of As and 6.4% of Cs being methylated) than those in the plasmid (1% of As and 2.4% of Cs) (Fig. 6A). Modifications were evenly distributed within a given molecule, except for a single cluster of m 6 A in the chromosome, where this methylation 'spike' is focused on a specific gene, BOB39_12100 as depicted in Fig. 6B. The majority (73.6%) of m 6 A bases occurred in three sequence motifs (ATGNNNNNNRTGG/ CCAYNNNNNNCAT, GATC and GGGAG). Two of these (ATGNNNNNNRTGG/CCAYNNNNNNCAT and GATC) are almost always methylated (97-99.5% of occurrences) while the third (GGGAG) is frequently modified (77.2% of occurrences). By contrast, the m 4 C modifications are not strongly concentrated in motifs. The motif most frequently associated with this modification (CSNNNTB) is only modified in 9.2% of occurrences (about 3 times the background rate for all cytosines).
DNA methylation in bacteria is often associated with restriction modifications (RM) systems, which protect the bacterial cell from foreign DNA. These systems combine a restriction endonuclease that digests unmethylated copies of target sequences and a DNA methyltransferase that methylates this sequence motif in the bacterium's own DNA. The strong association between m 6 A modification and three sequence motifs in the L. longbeachae genome suggests this modification is part of an RM system.  www.nature.com/scientificreports/ www.nature.com/scientificreports/ Using REBASE, we identified putative methyltransferases and encodnucleases in the L. longbeachae genome. This analysis revealed three neighbouring genes that encode a type I RM system associated with the ATGNNNNNNRTGG/CCAYNNNNNNCAT motif. Specifically, gene B0B39_08545 encodes a SAM-dependent DNA methyltransferase with target recognition domains for both ends of this motif, while genes B0B39_08550 and B0B39_08555 encode the S and R subunits of an associated endonuclease. The enzymes responsible for the GATC and GGGAG motifs are less clear. Two proteins (LloF1157ORF6795P and LloF1157ORF8795P) are homologous to methyltransferases that recognize GATC in other species. Neither of these proteins are associated with a restriction endonuclease.
Although many bacterial genomes contain the m 4 C modification, the biological functions encoded by it remains unclear 47 . There is some evidence that this mark may contribute to the regulation of gene expression. Notably, the deletion of a single m 4 C methyltransferase in Helicobacter pylori alters the expression of more than 100 genes and leads to reduced virulence. We used our genome annotation and methylation data to test for any associations between m 4 C methylation and genome features of functional classes of genes that might suggest this mark contributes to gene regulation in L. longbeachae. We found it is considerably more common within protein coding genes than intergenic spaces (Fig. 6C). However, there is no association between the presence of this mark in a gene sequence and any of the functional classifications present in our COG data (Fig. 6D). Although the over-representation of m 4 C bases in genetic sequences suggests it might be associated with, or a passive consequence of, transcription in L. longbeachae, we find no evidence it contributes to particular biological functions.
In summary, we have demonstrated that most genomic variability in L. longbeachae is from recombination with large-scale rearrangements in the chromosome. Our 54 sg1 clinical isolates could be grouped into two highly related clades that persisted over time. The most genetically distinct clade consisted of isolates from only the Canterbury region but could just reflect oversampling from this region. Further sequencing of isolates from other regions is required. Most sequenced isolates were found to contain a plasmid that showed high levels of recombination and horizontal gene transfer with evidence for both intra-and inter-species gene-flow. The genome of L. longbeachae was also highly modified, with m 6 A modifications being the most common and was strongly associated with particular sequence motifs.

Materials and methods
Bacterial isolates, sequencing and genome assembly. A total of 60 isolates previously identified as L. longbeachae (including 57 serotyped as sg1 and 3 serotyped either as sg2 or undefined) were sequenced. Isolates were obtained from either the NZ Legionella Reference Laboratory (ESR, Porirua, New Zealand; n = 39) or Canterbury Health Laboratories (CHL) culture collection (Christchurch, New Zealand; n = 21). All isolates were derived from sporadic LD cases that occurred between 1993 and 2015 from 8 regions (S7 Figure) around the country and included the first NZ case in which L. longbeachae was successfully cultured from a patient specimen (LA93_171; S2 Table).
Isolates were grown on buffered-charcoal-yeast-extract (BCYE) agar at 37 °C for 72 h. DNA was extracted from each fresh culture using GenElute Bacterial Genomic kits (Sigma-Aldrich, MO, USA) according to the manufacturer's instructions. Libraries were prepared using the Nextera XT kit (Illumina, San Diego, CA, USA) and were sequenced using Illumina MiSeq technology (2 × 250 bp paired-end) and version 2 chemistry by NZ Genomics Ltd (University of Otago, Dunedin, NZ). The quality of the raw reads was checked using FastQC (v. 0.11.4; https:// www. bioin forma tics. babra ham. ac. uk/ proje cts/ fastqc/). They were mapped against PhiX using Bowtie2 (v. 2.0.2 48 ), and any that mapped to PhiX were removed from the SAM file, and read pairs were reconstructed using the SamToFastq.jar program from the Picard suite (v. 1.107; https:// broad insti tute. github. io/ picard/) using the default parameters. Any adaptors were removed through the "fastq-mcf " program (using the default parameters) from the ea-utils suite of tools (v. 1.1.2-621; https:// expre ssion analy sis. github. io/ ea-utils/). Finally the reads were quality trimmed using SolexaQA++ (v. 3.1.4 49 ) at a probability threshold of 0.01 and sorted on length to remove any sequences < 50 bp prior to assembly. Sequence reads from each isolate was assembled using SPAdes (v. 1.2 50 ) de novo assembler in "careful" mode, with default settings. Table), 57 were found to be L. longbeachae sg1, two were sg2 and one had been mistyped and was Legionella sainthelensi. Analyses were limited to the sg1 isolates but because poor sequence data were obtained for three genomes (2 from Auckland and 1 from Waikato), only 54 were included (Table 1). We also included the two other publicly available complete genomes for L. longbeachae sg1, NSW150 (Australia; GenBank: NC_013861) and FDAARGOS_201 (USA; GenBank: NZ_CP020412) in our core genome and cluster of orthologous groups (COG) analyses.

Sequenced strains analysed. Of the 60 isolates (S2
The reads of a further 65 previously published L. longbeachae isolates (Bioproject number PRJEB14754, SRA accession numbers ERS1345649 to ERS1345585 17 ) were downloaded and compared with our 54 sg1 isolates. However, 30 of these read sets were either of poor quality, aligning to less than 80% of our reference genome (F1157CHC; GenBank NZ_CP020894 19 ), or were not L. longbeachae sg1 isolates and were excluded. The remaining 35 read sets were included in our global phylogenetic analyses (S6 Table).
Ancestral state reconstruction and phylogenetic analysis. Single nucleotide polymorphisms (SNPs) were identified using Snippy v2.6 (https:// github. com/ tseem an/ snippy). Snippy uses the Burrows-Wheelers Aligner 51 and SAMtools 52 to align reads of the 53 NZ L. longbeachae isolates to our reference genome Legionella longbeachae F1157CHC and FreeBayes 53 to identify variants among the alignments. Gubbins was used to remove areas of recombination on the full alignment 54 . SNPs were exported into BEAUti v2.5 to create an Extensive Markup Language (xml) file for BEAST v2.5 55  Global L. longbeachae isolates. As described above the read sets of 35 previously published L. longbeachae isolates 17 were downloaded and compared with our 54 NZ isolates using the SNP-identification method described above. In total, 89 L. longbeachae isolates from NZ and the UK were investigated for our global analysis. RaxML 63 was used to form a maximum likelihood tree of the isolates based on their SNP data and was visualised using EvolView v2.
Core genome and COG analyses. The eggNOG-mapper 25,64 ) webserver with default parameters (http:// eggnog-mapper. embl. de/) was used to annotate the F1157CHC PGAP-derived amino acid sequences. The Prokka pipeline (v. 1.12 65 ) was used to annotate our draft isolates using default parameters. The Prokka-generated GFF files were analysed with Roary using default parameters, and the comparison script roary_plots.py was used to visualize the output. FastGeP was used with default parameters to perform a whole genome MLST analysis of the 56 isolates, which meant that the generated allele sequences were searched with BLAST + at an identity threshold ≥ 80%. F1157CHC was used as the reference genome for this analysis. SplitsTree (v.4.15.2 66,67 ) was used to convert the FastGeP Nexus file into a Newick file (as a Neighbour-joining tree) for visualization and annotation in iTOL with the inclusion of metadata for region and the sample type.
Complete NZ reference genome, gene prediction and annotation. To generate our own complete NZ reference genome, one isolate (F1157CHC) was further sequenced using the PacBio RSII system (Pacific Biosciences, CA, USA) as previously described 19 . Gene prediction and annotation was performed using the NCBI Prokaryotic Genome Annotation Pipeline (2013).  Total  1993-1997 1998-2002 2003-2007 2008-2012 2013-2015   Northland  -2  ---2   Auckland  ---2  9  11   Waikato  1  -1  --2   Bay of plenty  --1  --1   Rotorua  ----1  1 Manawatu www.nature.com/scientificreports/ Genome architecture. In order to assess the genome architecture, F1157CHC was used as the basis for all analyses in which comparisons were made against a reference. The genome was visualized using Circos software (v. 0.69.3 69 ). Tracks included mapping the annotation prediction from PGAP, as well an overlay of the results of a functional annotation with the eggNOG web annotation server, mapping of both the methylation results and recombinant regions detected with Gubbins, SNP density of the comparative samples as defined by SNIPPY, and finally a visualization of the repeats within the F1157CHC genome using Reputer 70,71 .
The genome was analysed with the following Reputer parameters (number of best hits: 10,000; minimum length: 30 bp; and maximum Hamming distance: 3), and the output parsed through a MySQL database with a custom Perl script to generate the tracks to allow the links between all repeated regions to be visualized on the Circos plot. Of the four possible repeats, only the forward and palindromic repeats were detected. Furthermore, depending on the Hamming distance between the two repeats, the links were coloured to show those with a smaller Hamming distance as a darker colour. In order to assess the overall genome architecture in comparison to other L. longbeachae genomes, the MAUVE plugin within Geneious (v. 9.1.8) was used to visualize the F1157CHC genome against NSW150 and FDAARGOS_201.
Legionella longbeachae methylome. Methylated bases were detected for isolate F1157CHC, using the "RS_Modification_and Motif Analysis" protocol implemented in SMRTAnalysis v2.3.0 using the SMRTbell DNA library described above as input. This pipeline takes advantage of BLASR (v1 72 ) to map sequencing reads to the assembled genome and MotifFinder v1 to identify sequence motifs associated with particular modifications. The resulting files were submitted to REBASE 73 along with our annotated reference genome to identify protein coding genes that may be responsible for the inferred methylation patterns.
The distribution of methylated bases on the reference genome, and with regard to genomic features was analysed using bedtools (v2.25.0 74 ) and the R statistical language (v3.4). We tested for differences in methylation rate between genes of different functional classes using anova, as implemented in R. A complete record of the code used to perform statistical analyses and visualisation of the methylome data is provided in (S4 file).

Data availability
The raw reads of the 54 New Zealand L. longbeachae isolates are available in the NCBI Bioproject database (https:// www. ncbi. nlm. nih. gov/ biopr oject/) under accession number PRJNA417721 and in the Sequence Read Archive (SRA) database (https:// www. ncbi. nlm. gov/ sra) under accession numbers SRX3379702-SRX3379755. Our complete annotated reference genome for isolate F1157CHC is available in genbank genome (https:// www. ncbi. nlm. gov/ genome) under accession numbers NZ_CP020894 (chromosome) and NZ_CP020895 (plasmid). All supporting data, code and protocols have been provided within the article or through Supplementary Data.