Population structure of Environmental and Clinical Legionella pneumophila isolates in Catalonia

Legionella is the causative agent of Legionnaires’ disease (LD). In Spain, Catalonia is the region with the highest incidence of LD cases. The characterisation of clinical and environmental isolates using molecular epidemiology techniques provides epidemiological data for a specific geographic region and makes it possible to carry out phylogenetic and population-based analyses. The aim of this study was to describe and compare environmental and clinical isolates of Legionella pneumophila in Catalonia using sequence-based typing and monoclonal antibody subgrouping. A total of 528 isolates were characterised. For data analysis, the isolates were filtered to reduce redundancies, and 266 isolates (109 clinical and 157 environmental) were finally included. Thirty-two per cent of the clinical isolates were ST23, ST37 and ST1 while 40% of the environmental isolates were ST284 and ST1. Although the index of diversity was higher in clinical than in environmental ST isolates, we observed that clinical STs were similar to those recorded in other regions but that environmental STs were more confined to particular study areas. This observation supports the idea that only certain STs trigger cases or outbreaks in humans. Therefore, comparison of the genomes of clinical and environmental isolates could provide important information about the traits that favour infection or environmental persistence.

ST comparison between groups. The distribution of STs differed between groups. Twenty-nine out of 46 clinical STs appeared only in one group. Two of these 29 STs (ST193 and ST436) were found in both the CA and HA groups; 21 were found only in CA and six only in HA. Most of the environmental STs (44/61) did not appear in the clinical group. According to acquisition, only five STs (ST337, ST813, ST1836, ST1837 and ST2179) were exclusive to HOSP environments, 25 were exclusive to CT environments, one (ST728) was found in both, and the other 14 were found in other environments or in more than one environment.
Phylogenetic analysis using concatenated sequences divided the isolates into 15 groups (G) and 39 individuals (STs that could not be assigned to any phylogenetic group) (Fig. 3B). The biggest group (G11) included seven STs. Seven out of the 39 individuals had more than 23 bp differences and the other 32 had between five and 23 bp differences with regard to each other. Interestingly, ST1 was found as an individual differing in 8 bp in the allele neu from ST7 (G12).
The cluster distribution of STs into the two analyses (allele (CC) or concatenated sequences (G)) presented differences. The STs that formed the different allelic CCs were divided into different concatenated Gs, and some singletons were linked to other STs which belonged to Gs. For example, G6 was made up of only singletons (ST1829, ST1842 and ST1106) with three bp differences between them. CC5 was completely deconstructed, forming G3 with only two STs (ST193 and ST1839) plus one singleton (ST9). The three other STs that constituted CC5 were found as individuals in the concatenated sequence analysis, two having between five and 23 bp differences (ST242 and ST187) and the other one having more than 24 bp differences (ST1326). Similarly, CC3 was divided into three different Gs (G4, G5 and G7) and two individuals had between five and 23 bp differences. G5 comprised two STs from CC3 (ST1883 and ST1119), while G4 consisted of three STs from CC3 (ST728, ST87 and ST828) and one ST from CC10 (ST1827). G7 comprised only one ST from CC3 (ST1831) and two singleton STs (ST173 and ST1686) which differed from ST1831 by only 3 and 4 bp, respectively.

Discussion
The L. pneumophila population has been described in many countries [13][14][15][16][17][18][19][20][21][22][23] . Despite the high incidence of LD in Spain, the L. pneumophila population has been investigated in only two areas, the Comunitat Valenciana and Catalonia 9,13 : clinical and environmental L. pneumophila populations in the Comunitat Valenciana 13 , and only L. pneumophila clinical isolates in Catalonia 9 . In the present study, we compared clinical and environmental isolates of L. pneumophila recorded in Catalonia by SBT and MAb.
First, the isolates identified as L. pneumophila sg 1 were subgrouped by Dresden MAbs, including the virulence-associated MAb 3/1 (or MAb 2 in the Jolly panel 24 ). The proportion of clinical MAb 3/1 positive isolates was higher than that of environmental isolates, as in other studies carried out in the United Kingdom, England-Wales and Canada (clinical: 79.6%, 96.6% and 63.6% vs. environmental: 12.8%, 8.3% and 17.1% respectively) 16,20,21 . Amemura-Maekawa et al. suggested that the MAb 3/1 epitope is easily lost or gained during adaptation to environments when there is no pressure to retain human pathogenicity 14 . Loss of the MAb 3/1 epitope may provide an advantage for fitness, while decreasing the potential to infect humans.
Furthermore, as regards acquisition, CA isolates that expressed the epitope Mab 3/1 were three times more frequent than HA isolates. This phenomenon was already observed in previous studies 20,25,26 , where it was suggested that the distribution of HA isolates is representative of environmental colonization and host susceptibility rather than of virulence traits 26 . Conversely, there was a greater proportion of MAb 3/1 positive isolates in the HA group than in the environmental HOSP group. This trend suggests that although patient susceptibility is important for HA cases, the ability of the isolates to cause infection may also be a risk factor.
For its part, the low number of MAb 3/1 negative isolates and the absence of L. pneumophila non sg 1 in the CA group could be due to an under-diagnosis of LD in the community, which may be favoured by the diagnostic tests currently used. Most of the reported cases of LD in Europe (88.6%) 4 are diagnosed by urine antigen detection, but this diagnostic tool has been reported to be less sensitive to MAb 3/1 negative isolates and non-specific for L. pneumophila non sg 1 27,28 . Furthermore, HA cases present an advantage in that the sample is collected before the treatment of LD and after the appearance of the first symptoms, which helps to obtain isolates from respiratory samples.
The high SBT diversity observed in the isolates analysed was comparable to the clinical and environmental diversity found in other larger countries (UK, US and Canada) 16,20,23 . Given that the clinical isolates came from the environment, we expected them to be less diverse than the environmental isolates, but in fact, corroborating previous findings from the USA and Canada study 15,20 , they had greater diversity. The diversity of environmental SBT in Japan and the US showed that artificial environments present a trend towards a lower IOD 14,15 . The environmental isolates from Catalonia had a high diversity, although all the isolates came from artificial environments. The factors affecting the degree of genetic diversity in certain areas are unknown. Possible explanations include variations in the quality of the water or in the disinfection measures applied due to the differences in environmental regulations. Indeed, even according to origin, the diversity of the environmental isolates in this study continued to be higher than in other countries. The environmental HOSP group comprised a less diversified population and was divided into only 13 STs, with ST284 predominating (29%), followed by ST1 (19%). The diversity of this group of isolates was similar to that of the artificial environmental isolates in Japan and China 14,29 . The diversity of HA was similar to that of the environmental isolates and slightly higher than that of the HOSP isolates. Furthermore, the distribution of the HA isolates mirrored that of the environmental and HOSP isolates, supporting the notion that HA isolate distribution may be representative of environmental colonisation in which the virulence traits of the strains are less relevant than the intrinsic risk factors of inpatients. Despite these similarities, the proportion of MAb 3/1 positive isolates was higher in the HA isolates; this might be explained by the hypothesis of Amemura-Maekawa et al. 14 , which posits that the MAb 3/1 epitope is gained and lost according to the environmental pressure (that is, the need to retain human pathogenicity).
A total of 90 STs were identified, and only 17 were shared between clinical and environmental isolates. This difference in the distribution of STs has been reported in previous studies 16,20 . ST1 was the most frequent, accounting for almost 18% of the isolates. This ST 14,20,25,[30][31][32] is found worldwide, and is the most abundant in the EWGLI database (13%). In our study, clinical isolate ST1 was found in 9.17%, and tended to be more frequent in the HA isolates. In our study and in the 2014 US study 15 , ST1 was found more frequently in environmental than in clinical isolates. The reason for the low concordance between environmental and clinical frequency in our area is unknown, but there are several possible explanations. The first is the unavailability of ST1 clinical isolates in our database due to difficulties in culturing respiratory samples. Second, ST1 may be well adapted to the environment, with a low ability to infect, and may be more likely to trigger LD in hosts with a high level of risk factors such as HA cases.
In the present study, ST23 and ST37 were the most frequently found in the clinical group. In France, ST23 22 was the most frequently found in the clinical group together with ST1 and ST47, and, as in our study, it was rarely found in the environmental group. In the MST analysis ST23 was located in CC12 with ST1012 and ST1574, both of which were only found as clinical STs, with one and three bp differences in the ST23-mompS gene respectively. As other authors have suggested with other sets of STs 15,33 , this ST complex probably has a high capacity for infection, albeit with a reduced persistence in the environment. The MST analysis by concatenated sequences showed a grouping of these three STs in the G11 with four other STs: two STs were singletons in the CC analysis (ST75 only in the clinical group, and ST96 only in the environmental group) and two STs belonged to the CC13 (ST1874 only in the environmental group, ST20 in both clinical and environmental groups).
All the ST37 isolates in this study were characterised as MAb 3/1 positive, subgroup Philadelphia. This ST is related to ST36 and ST35, which were reported to be the cause of a large number of cases of LD in Europe and the US 15 . Harrison et al. suggested that these STs belonged to a ST group with an increased capacity to trigger LD 33 . ST37 was responsible for a large outbreak in Catalonia in 2002 34 , and the largest outbreak described in Murcia, Spain in 2000 35 . In the MST analysis ST37 was located in CC6 together with ST1579, which was only found in clinical isolates, and ST1776, which was only found in environmental isolates. ST37 had seven bp differences compared to ST1776 in the neuA gene and three bp differences compared to ST1579 in the asd gene. Therefore, in the concatenated analysis ST37 was only linked to ST1579, the clinical ST; together they formed the G2. MST analysis by concatenated sequences may be a useful tool for extrapolating information about the variation in genes (the different alleles of each gene) and their role implication. The most frequent environmental STs were ST1, ST284 and ST181, which accounted for 54.3% of the environmental L. pneumophila sg 1 isolates. The most representative ST from the L. pneumophila non sg 1 isolates was ST87 (22.2%). Together with ST42, these three isolates (ST1, ST284 and ST181) were the most common in HA isolates (43%), but were only found in 11% of CA isolates. Moreover, these three STs together with ST87 made up 70% of the HOSP isolates. This relative homology between HA and HOSP indicates that the relationship between the presence of the isolates and patient susceptibility is a possible cause of LD development.
Compared to all the other isolates, the CA group was the most differentiated, although the MST analyses did not show a cluster group that differed from the others. Interestingly, the CA isolates were similar to those from the regions studied in the US and Europe 16,20,23 . Comparing the presence of all the environmental STs (except ST1) in Catalonia with other regions in the EWGLI database 36 , the environmental isolates were found to be more heterogeneous than clinical STs. These results could be due, in part, to the low number of environmental studies and the low rate of ST notification to the EWGLI database.
The classification of isolates by phenons increased the diversity, but few phenons comprised the majority of clinical or environmental isolates. These findings corroborate those of a UK study 16 where 45% of clinical isolates were divided into four STs and 31% of environmental isolates were divided into three STs.
A major limitation of this study was that the isolates included were from epidemiological studies performed in our laboratory. This may have introduced a bias because current diagnostic methods under-diagnose LD cases caused by L. pneumophila non-sg 1 and L. pneumophila sg 1 MAb 3/1 negative isolates; what is more, since the environmental isolates came from case/outbreak investigations (i.e., they were not random either in terms of location or of time), we cannot extrapolate our results to routine environmental samples (i.e., samples from quality and prevention programs) and they may not provide an accurate reflection of the population structure of the area studied. This bias might be increased further by the fact that the majority of clinical isolates were L. pneumophila sg 1, and in the selection of the environmental isolates for the typing process most of the non sg 1 isolates were discarded. Therefore, although some non sg 1 were characterised, their results are not representative of this population; the possibility of other subtypes coexisting in the environment in addition to the ones included cannot be ruled out. Another limitation may be the variations in the surveillance systems and methods used to detect Legionella from hospital to hospital, which means that we cannot be sure that all LD cases were tested for Legionella during the study period. Likewise, the low productivity of sputum cultures made it difficult to obtain more clinical isolates for evaluation. All these limitations were increased by subdividing the isolates into smaller groups; although a trend was found in each subgroup, the differences observed may not represent a true reflection of the population structure.
In addition, although SBT is considered the gold standard, this method is based on the characterisation of only seven genes. The analysis of 2.5 kb out of 3.5 Mb of the genome 37-39 must be interpreted with caution, because it may not present a true reflection of the population structure. For example, in previous studies when unrelated ST1 isolates were typed by other molecular methods, the isolates were divided into different subtypes [40][41][42] .
This study presents a description and comparison of Catalan L. pneumophila environmental and clinical isolates. With regard to MAb subgrouping and ST typing, the clinical CA isolates were the group that differed the most from the others, with a higher proportion of MAb 3/1 positive isolates and fewer shared STs. Their capacity to infect and their adaptability to the environment may be due to the acquisition of some virulent or adaptability traits by horizontal gene transfer from other isolates coexisting in the same environmental source 43,44 . Likewise, some isolates could be better adapted to the environment or more prone to infect humans. For its part, the high similarity between HOSP and HA isolates may be more closely related to the abundance of these isolates in the environment and to patient susceptibility than to the virulence capacity itself.
Future perspectives. Although this study describes the distribution and classification of environmental and clinical STs in Catalonia, further studies are needed using whole genome sequencing (WGS) as a tool to classify the isolates and to identify the factors/genes that enhance the ability to infect humans.
The major drawback of WGS is the use of different reference genomes and bioinformatics pipelines and the lack of consensus regarding the establishment of a minimum number of different pb to determine whether two isolates are identical. As shown in the two phylogenetic approach analyses in the present study, the classification of the isolates varies according to whether allelic (CC) or concatenated sequences (G) are used. Nor is there any consensus regarding the limit of sequence variances needed to accurately interpret the large amount of data obtained using this method. However, the high level of discrimination offered by WGS [45][46][47] makes it a good tool for epidemiological studies. To date, most WGS studies have used a single nucleotide polymorphism (SNP)-based approach and have found the related environmental and clinical isolates to be <20 SNPs apart 48,49 . Currently an extended core genome MLST (cgMLST) scheme with approximately 50-100 core genes 47 is being designed and tested for L. pneumophila. This new approach has advantages over other analyses of WGS data and may establish itself as the new gold standard for population structure analysis.

Methods
Bacterial isolates. A total of 528 isolates of L. pneumophila from Catalonia were collected for the study.
These isolates came from epidemiological studies of LD conducted between 1989 and 2016. Isolates were maintained in glycerol-Brain Heart Infusion (BHI) (Oxoid) at −80 °C. They were seeded and grown on BCYE agar plates incubated at 37 °C during 72 hours, and were evaluated by MAb typing and SBT.
The study was conducted in accordance with the Declaration of Helsinki and Spanish legislation concerning clinical research. The protocols were approved by the local Human Research Ethical Committee (CEIC-HUGTP, Badalona). Since 2001, our laboratory has been the Catalan Reference Laboratory for Legionella typing in LD outbreak/cluster investigations. Because the health authorities referred anonymised clinical isolates for molecular typing to our laboratory, a waiver for informed consent was obtained from the local Human Research Ethical Committee in accordance with Spanish law. Clinical isolates. A total of 159 clinical isolates were collected and characterised (1 isolate/patient). Among the related isolates from the same outbreak (from patients with a probable or confirmed common source of infection), the first isolate from each outbreak was included in the analysis as a representative of the case/outbreak group, and 109 clinical isolates were fully analysed. Clinical isolates were divided according to the source of acquisition into: Hospital-acquired (HA = 32), Community-acquired (CA = 72) and unknown origin (UO = 5) (epidemiological data not available). In a previous study 9 we characterised 95 out of these 109 isolates by SBT and MAb subgrouping.
Environmental isolates. We included 369 environmental L. pneumophila isolates recovered and isolated during epidemiological studies of LD cases between 1989 and 2016.
Environmental isolates related to LD cases were excluded from the analyses, and the isolates which presented the same ST and MAb from the same point analysed were eliminated in order to reduce the redundancy of the results. One hundred and fifty-seven environmental isolates were fully analysed. These isolates were classified according to the origin of the sample: Cooling Tower (CT, n = 86), Hotel (n = 4), Hospital (HOSP, n = 31), Water Distribution System (WDS, n = 12), Other (spas, sprinklers, tanks and pools; n = 8) and Unknown (data not available, n = 16).
MAbs from the Dresden Panel were used to determine the serogroup of L. pneumophila and the phenotypic subgroup of L. pneumophila serogroup 1. The determination of the serogroup is based on the reaction against a specific antibody by indirect immunofluorescence for each serogroup, and the phenotypic subgroup of L. pneumophila sg 1 is based on the reaction against seven MAbs by an indirect immunofluorescence assay, and by dividing sg 1 into nine different phenotypic subgroups according to a flow chart 26 . SBT. Genomic DNA was extracted using the Chelex ™ extraction technique (Bio-Rad Laboratories, CA, USA).
The seven target genes (flaA, pilE, asd, mip, mompS, proA, and neuA) were amplified using the primers and amplification protocol provided by the EWGLI (4.2). DNA sequencing was performed on a 3130xl (3100) ABI Prism genetic analyser (Applied Biosystems) at the Genomics Core Facility at the Germans Trias i Pujol Research Institute. Sequences were analysed using Sequence Scanner software v1.0 (Applied Biosystems). Then, the allele number was assigned using the SBT-EWGLI Quality tool, and the ST profiles were defined with the SBT-EWGLI database.
Data interpretation. The diversity index (IOD) for each variable was calculated using the Hunter Gaston index with the VDICE online tool (http://www.hpa-bioinformatics.org.uk/cgi-bin/DICI/DICI.pl).
Minimum spanning trees (MST) were created by goeBURST implemented in PHYLOViz 50 . STs were represented by circles; the size of a circle indicates the number of isolates of this particular ST. Two approaches were performed: conventional MST using allelic profiles, and concatenated sequences of the alleles. The clonal complexes (CC) were calculated with the allelic profiles and were defined using the criteria of single locus variant. STs that cannot be assigned to any clonal complex are called singletons.
The Groups (G) in the MST using the concatenated sequences were restricted to differences of 4 bp or less between STs. After several analyses (data not shown), we chose a cut-off of 4 bp which resulted in a similar number of Groups (15) to CCs (14). STs that cannot be assigned to any phylogenetic Group are called individuals. Statistical analysis. Statistical analyses were carried out using SPSS statistical package program software (version 19 SPPS, Chicago IL, USA). The results for categorical variables were expressed as absolute and relative frequencies and medians and interquartiles. The Chi Square test was used to compare the proportional distributions among groups. The statistical tests used in the study were two-sided, and a p value of 0.05 or less was considered statistically significant.