Study of a classification algorithm for AIEC identification in geographically distinct E. coli strains.

Adherent-invasive Escherichia coli (AIEC) have been extensively implicated in Crohn’s disease pathogenesis. Currently, AIEC is identified phenotypically, since no molecular marker specific for AIEC exists. An algorithm based on single nucleotide polymorphisms was previously presented as a potential molecular tool to classify AIEC/non-AIEC, with 84% accuracy on a collection of 50 strains isolated in Girona (Spain). Herein, our aim was to determine the accuracy of the tool using AIEC/non-AIEC isolates from different geographical origins and extraintestinal pathogenic E. coli (ExPEC) strains. The accuracy of the tool was significantly reduced (61%) when external AIEC/non-AIEC strains from France, Chile, Mallorca (Spain) and Australia (82 AIEC, 57 non-AIEC and 45 ExPEC strains in total) were included. However, the inclusion of only the ExPEC strains showed that the tool was fairly accurate at differentiating these two close pathotypes (84.6% sensitivity; 79% accuracy). Moreover, the accuracy was still high (81%) for those AIEC/non-AIEC strains isolated from Girona and Mallorca (N = 63); two collections obtained from independent studies but geographically close. Our findings indicate that the presented tool is not universal since it would be only applicable for strains from similar geographic origin and demonstrates the need to include strains from different origins to validate such tools.

The involvement of the adherent-invasive Escherichia coli (AIEC) pathotype in Crohn's disease (CD) pathogenesis has been extensively supported, as many researchers have reported higher AIEC prevalence in CD patients than controls [1][2][3][4][5][6][7][8][9] , and mechanisms of pathogenicity have been linked with CD pathophysiology [10][11][12][13][14][15][16][17] . The ability to adhere to and invade intestinal epithelial cells, as well as, to survive and replicate inside macrophages are key characteristics of AIEC strains 2 . No gene or sequence exclusive to the AIEC pathotype has been identified, and AIEC identification currently remains challenging; the only way to identify an AIEC strain is by assessing bacterial infection in cell culture assays which are non-standardised and highly time-consuming 2 .
Up to now, six genetic elements (pduC, lpfA, lpfA + gipA, chuA, 29 point mutations and 3 genomic regions) have been suggested as putative AIEC molecular markers 6,21,23,28,29 , however they either present low sensitivity or have been studied in a small number of strains. In a previous study conducted in our research group 30 , we designed a classification algorithm based on the identification of the nucleotides present in three Single Nucleotide Polymorphisms (SNPs). This algorithm displayed 82.1% specificity, 86.4% sensitivity and 84.0% accuracy within our Spanish strain collection. Given the high genotypic variability of AIEC, our aim was to validate the tool previously presented in AIEC/non-AIEC strains from distant geographical origins and ExPEC strains in order to assess the usefulness of these SNPs as molecular signatures for AIEC screening in external collections.

Results
Confirmation of the validity of the algorithm 30 in additional geographically distant AIEC/non-AIEC and ExPEC strains was performed.
When all AIEC/non-AIEC strains from Girona, Mallorca, France, Chile and Australia, as well as ExPEC strains were analysed, 73/98 of the non-AIEC strains were correctly classified but only 39/86 of the AIEC strains were appropriately predicted, resulting in a high probability of obtaining false negatives (54.6%). Therefore, in comparison to the values obtained within our strain collection (82.1% specificity, 86.4% sensitivity and 84.0% accuracy), the global accuracy was significantly reduced (60.9%), with decreased specificity (74.5%) and especially lower sensitivity (45.4%) ( Table 1, Fig. 1). In contrast to the previous study 30 , the SNPs that were found to be differentially distributed among our AIEC and non-AIEC strains (E3-E4_4.4 and E5-E6_3.16 = 3.22(2)) showed similar frequencies according to phenotype when all the strains were considered ( Table 2). According to the algorithm 30 , strains displaying guanine (G) in SNP E3-E4_4.4 are classified as non-AIEC, and the same occurs for those that do not have the gene (−) where SNP E3-E4_4.4 is located and display a nucleotide other than G at SNP E5-E6_3.16 = 3.22 (2). Indeed, most AIEC strains (54.6%) were incorrectly classified because they accomplished these conditions (Fig. 2). Other possible SNP combinations were considered for all the strains included in the study but none improved the precision of the algorithm.
Despite global accuracy of the algorithm being much lower when all strains were considered, the method was suitable for geographically close strain collections. Indeed, if only Spanish strains (Girona and Mallorca) (N = 63) were considered, the accuracy of the tool was maintained (80.9%) ( Table 1). Specificity was also good (82.3%), meaning there was a low probability of false positives (17.7%) (Fig. 1). Therefore, strains from different laboratory collections, but of similar geographical origin, were suitable for screening by this method.
The inclusion of ExPEC strains (N = 45) revealed that the tool was also useful for distinguishing the ExPEC and AIEC pathotypes, since 84.6% of strains displaying the AIEC phenotype were correctly classified, with a global accuracy of 78.9% (Table 1, Fig. 1).
These results demonstrated that the classification algorithm presented has limited applicability for all E. coli strains assessed. However, this novel molecular tool showed promising results for Spanish AIEC and ExPEC strains.

Discussion
The identification of molecular tools or rapid tests to easily identify the AIEC pathotype would be of great interest to scientists studying the epidemiology of the pathotype, as well as clinicians hoping to detect which patients are colonised by AIEC to apply personalised treatments. Although several studies have been conducted with this aim in mind, there is still no molecular signature specific to AIEC 6,21,23,28,29 . www.nature.com/scientificreports www.nature.com/scientificreports/ In a previous study we performed comparative genomics of three AIEC/non-AIEC clone pairs and presented a classification algorithm that combines three SNPs, allowing for the classification of phylogenetically and phenotypically diverse E. coli isolates with a high accuracy rate in our strain collection 30 . Since the application of a molecular tool could assist in overcoming the problem of AIEC identification, we further tested the specificity and sensitivity of the tool in additional geographically distant and phylogenetically diverse AIEC strains, as well as ExPEC strains, which share genetic and phenotypic features 3,24-26 .
The tool was found to be accurate enough to distinguish between AIEC and ExPEC strains, since the sensitivity was 84.6% and the accuracy was 78.9%. In this case, we assessed both AIEC/non-AIEC from Girona (Spain) and ExPEC strains, the latter being mostly Spanish isolates. These results indicated that for a given geographic origin this algorithm could be applied to differentiate ExPEC from AIEC. So far, most of the studies looking for AIEC biomarkers have not included ExPEC strains in their analysis 6,21,23,28 . There is only one that focused www.nature.com/scientificreports www.nature.com/scientificreports/ on synonymous and non-synonymous SNPs along the genome of four B2-AIEC strains that could differentiate them from other B2-non-AIEC and B2-ExPEC genomes available in databases. Although they found 29 SNPs that could separate AIEC from non-AIEC using a bioinformatics approach, but did not include the three SNPs in the presented algorithm, it did not find a signature sequence that distinguishes AIEC from ExPEC 29 . It is not possible to determine whether the high accuracy value we reported is due to similar geographic origin (40 from Spain and 5 USA) or not. Thus the inclusion of other ExPEC strains would be needed to validate the tool further. Unfortunately, the predicted values of the tool decreased considerably (60.9% of accuracy) when strains across several geographic regions were considered. AIEC isolates from France, Chile and Australia were poorly discriminated with the SNP algorithm presented, resulting in significantly reduced sensitivity values (32.3, 0 and 15.4% respectively). Of note, this algorithm may be suitable for Spanish strains, because the accuracy was still high when two different collections of strains were studied (Girona and Mallorca) (80.9% accuracy). Taking into account that the variable gene content of E. coli is highly variable across different geographic regions 31 , this variation contributes to the algorithm not being applicable across geographically diverse regions and it is subjected to possible variations in the accuracy presented in a particular country.
In conclusion, the molecular tool that we previously proposed 30 is not universal since its accuracy was reduced to 60.9% once a larger strain collection from different geographic locations and pathotypes was screened. We suspect it might be a good discrimination tool for a particular geographic location, in this case Spain. However, this observation should be confirmed with the addition of other Spanish strain collections including AIEC,   Table 2. Frequency of particular nucleotide variants in SNP E3-E4_4.4 and E5-E6_3.16 = 3.22(2) with respect to phenotype in two collections of AIEC/non-AIEC strains. Values are given in percentages with respect to the total number of AIEC or non-AIEC strains. *Others include those strains having T, S, K, Y or not having the gene where the SNP is encompassed. www.nature.com/scientificreports www.nature.com/scientificreports/ non-AIEC, and other E. coli intestinal and extraintestinal pathotypes. The study of new SNPs that could be useful to distinguish between AIEC/non-AIEC strains from different geographical origins might be time-consuming and unprofitable and should consider many aspects that make it even more complicated (for example, the moment of strain isolation and the patient's treatment). Therefore, we believe that new approaches (e.g. transcriptomics, metabolomics or epigenetics) should be applied to find a universal AIEC biomarker that could be used as a rapid standardised method for detecting AIEC from E. coli isolates, or maybe just E. coli isolates that have a strong colonizing ability. Nonetheless, there is a possibility that a no universal marker exists and then it would be interesting to look for a biomarker that englobes the majority of AIEC strains 32 . In any case, this work highlighted the importance of validating putative molecular markers in a diverse strain collection, in terms of geographic origin and pathotype, in order to assess whether or not it could be used universally.

Methods
The SNPs included in the algorithm (E3-E4_4.4, E5-E6_3.16 = 3.22 (2) and E5-E6_3.12) were screened by PCR and Sanger sequencing. Primers and PCR conditions are indicated in Table 3. Apart from the strains assessed in the previous study (22 AIEC and 28 non-AIEC, which includes LF82 strain) 30 , this collection comprised 60 AIEC and 29 non-AIEC strains mainly isolated from CD patients and controls from distinct geographical origin (Spain (Mallorca) 6 , Chile 6 , France and Australia 33 ) ( Table S1). Most of these strains were phenotypically characterised in previous studies 6,33 . The adhesion and invasion indices of 25/33 Australian strains were measured in this study as previously described 1,30,34 in order to classify them phenotypically as AIEC or non-AIEC. In addition, 45 strains isolated from patients with extraintestinal diseases were also included; these were previously isolated from American patients with meningitis 35 , and Spanish patients with sepsis 26 or urinary tract infection 36 (Table S1). Phenotypic characterisation of these strains was performed by Martinez-Medina et al. 26 ; in which four strains presented the AIEC-phenotype and were considered as such in the analysis and 41 did not (these were classified as non-AIEC).
Strains studied in this study were previously isolated under the approval of the Ethics The differences in the distribution of nucleotides present in each polymorphic site between phenotype were calculated using the Χ 2 test. To establish the usefulness of the algorithm for AIEC identification, the specificity, sensitivity and accuracy values were measured as follows: Sensitivity (%)= (true positives/(true positives + false negatives)) × 100, Specificity (%)= (true negatives/(true negatives + false positives)) x 100; and, Accuracy (%)= ((true positives + true negatives)/(total of cases)) × 100. A p-value ≤ 0.05 was considered statistically significant in all cases.