GEFAAR: a generic framework for the analysis of antimicrobial resistance providing statistics and cluster analyses

Easy access to antimicrobial resistance data and meaningful visualization is essential to guide the empirical antimicrobial treatment and to promote the rational use of antimicrobial agents. Currently available solutions are commonly externally hosted, centralized systems. However, there is a need for close monitoring by local analysis tools. To fill this gap, we developed GEFAAR—a generic framework for the analysis of antimicrobial resistance data. Following the example of the German Robert Koch Institute (RKI), an interactive web-application is provided to determine basic pathogen and resistance statistics. In addition to the RKI’s externally maintained database, our application provides a generic framework to import tabular data and to analyze them safely in a local environment. Moreover, our application offers an intuitive web-based user interface to visualize resistance trend analysis as well as advanced cluster analyses on species- or clinic/unit level to generate alerts of potential transmission events.

Our software aims at reducing restrictions to input format to a minimum.As a first step, the initial upload is performed.A user selects a file and the corresponding field separator.All common separators (comma, semicolon, tab) are supported.In a subsequent step, a user chooses the columns containing the following metadata: species, clinic/unit, specimen and date.GEFAAR has no restrictions to how the columns are named in the original file (see Supplementary Information, Fig. S1).Different date formats are supported and user-definable.If a user selects a date format that is expected to contradict the data, e.g.selecting 'dd.mm.yy' , but the input does not contain any '.' in the corresponding column, a note is reported.
GEFAAR assumes that every line in the provided input file corresponds to one isolate.Every column, subsequent to the common metadata columns, provides information on resistance towards one antimicrobial agent.Sticking to common nomenclature, the information is expected to be coded as follows: 'R' = resistant, 'I' = susceptible with increased exposure, 'S' = susceptible, '-' = not analyzed (according to EUCAST 9 ).
Our application consists of four main analysis modules: (1) pathogen statistics, (2) resistance statistics, (3) trend analysis, and (4) cluster analyses.The pathogen statistics serve as basic overview of the available data.For a selected year and clinic/unit (optionally: all), an analysis of detected species vs specimen can be conducted.A tabular output is provided.
Both the resistance statistics and the trend analysis are based on a resistance analysis, automatically performed by GEFAAR.In this context, the information on S vs I vs R is processed.For every species and antimicrobial agent, the relative abundance of each category is determined.Performing a detailed analysis of resistance, the 95% confidence intervals are additionally determined, assuming a binomial distribution (n being the number of samples per species and antimicrobial agent, p being the relative abundance of R; method: Clopper-Pearson www.nature.com/scientificreports/intervals 10 ).For a selected year, specimen (optionally: all) and clinic/unit (optionally: all), the resistance statistics allow to generate a tabular and graphical overview of antimicrobial agents vs resistance per species.The trend analysis integrates this information on resistance per antimicrobial agent over all available years for a selected species.
In addition, GEFAAR provides the option to execute interactive cluster analyses on one's input data.A set of diverse clustering approaches is available: ordered heatmaps, hierarchical clustering via heatmap, dimensionality reduction and clustering via Uniform Manifold Approximation and Projection (UMAP) 11 .Hierarchical clustering is one of the most common and well-studied clustering approaches.It is robust, provides detailed information on observations most similar to each other, and is easy to interpret and understand 12 .Dimensionality reduction, on the contrary, provides a more complex approach.Diverse methods are available that allow to transform highdimensional data to a low-dimensional space.Thus, visualization by means of 2D plots is possible.In GEFAAR, we implemented dimensionality reduction by UMAP.Analyzing high-dimensional single-cell RNA-sequencing data as an example, UMAPs were evaluated as superior to other approaches like principal-component analysis (PCA) 13 or t-distributed stochastic neighbor embedding (t-SNE) 11,14,15 .By the help of UMAPS, detailed molecular characterization of heterogeneous medulloblastoma could be performed, considering four clinically relevant subgroups 16 .Equally, however, UMAPs also enabled to decipher the cellular development of spermatogonia in infertile men 17 .It should however be noted that-as UMAP is a nonlinear dimensionality reduction techniquethe axes and exact coordinates in the 2D plots cannot be interpreted as principle components as in PCAs.
Information on resistance vs clinic/unit (per species) as well as resistance vs species (per clinic/unit) can be analyzed for a selected year and specimen (optionally: all).In order to perform successful clustering, data can only contain a limited number of missing values.For ordered heatmaps, we exclude all antimicrobial agents with information on resistance missing in ≥ 97% of the samples (analysis per species and per clinic/unit).For hierarchical clustering, we first exclude all antimicrobial agents with information on resistance missing in ≥ 70% of the samples.Subsequently, we exclude all samples with information on resistance missing in ≥ 70% of the antimicrobial agents (analysis per species and per clinic/unit).As UMAPs can only be generated on even more complete data, stricter filtration has to be applied: First, all antimicrobial agents with information on resistance missing in ≥ 20% of the samples are excluded.Subsequently, all samples with any missing information on resist- ance are filtered.Based on our experience, there is commonly not enough data left for an analysis per clinic/unit due to the strict filtration required by UMAPs.Therefore, UMAP clustering is only implemented for resistance vs clinic/unit (per species).
Clusters are determined using the R package 'NbClust' 18 .Altogether, NbClust provides 30 different approaches (referred to as indices in NbClust) for determining the optimum number of clusters.However, considering a majority vote over all approaches available would result in a considerably increased run-time.To perform hierarchical clustering, we therefore use the fixed configuration: distance = 'euclidian' , method = 'wardD' , index = 'duda' 19 .To further optimize run-time, a maximum of 5 clusters is considered if < 100 observations are available.Otherwise, a maximum of 10 clusters is considered.In case the algorithm fails to determine an optimum number of clusters, as e.g.no model meets the threshold required by 'duda' , the message "no clustering possible" is reported.
To determine a stable clustering for UMAPs, we opted for a trade-off between exploring the accordance of assigned clusters using different approaches vs minimum run-time.The following empirically determined clustering strategy is followed: We choose distance = 'euclidian' and method = 'kmeans' .A minimum of 2 clusters, a maximum of 5 is considered.Clustering is performed using the following approaches: silhouette 20 , kl 21 , ch 22 , scott 23 , duda 19 and dunn 24 .Every approach reports a quality score for each of the possible number of clusters-2, 3, 4 and 5.A reliable clustering is assumed to be available if the following criteria are met: (1) at least two approaches out of kl, ch, scott, duda and dunn report the same number of clusters as optimum.(2) The standard deviation over all quality scores assigned by approach kl to the possible number of clusters-2, 3, 4 and 5-is ≥ 5. We assume that a superior clustering is characterized by a peak quality score, clearly differing from the other scores assigned.Thus, a high standard deviation is taken as an indicator for a unique clustering.(3) The standard deviation over all quality scores assigned by approach silhouette is ≥ 0.05.The optimum number of clusters is determined based on majority vote.Clusters are assigned according to priority: kl > ch > scott > duda.If this applied approach does not result in a unique clustering result, a corresponding note is displayed.
GEFAAR is programmed in R. A graphical user interface was developed using R Shiny.Interactive elements have been implemented to enable user-friendly operation.All selection menus are continuously updated based on the users' selection.For example, for a selected specimen, only clinics/units with available data are displayed.Additionally, results of all analyses can be easily exported from within the graphical user interface.The software code, including simulated data, is freely available at https:// github.com/ sandm anns/ gefaar.The R Shiny application can be directly accessed on the public server https:// gefaar.uni-muens ter.de.The button 'Load demo data' allows to simulate and analyze a random set of test data.

Dataset
In this article, we consider real data from samples collected at the University Hospital Münster (UKM) between 2020 and 2022.The data used in our analysis are routine data, to which we have access based on our daily practice.These data are anonymized.According to the federal law, an informed consent to process these data is not needed (Gesetz zum Schutz personenbezogener Daten im Gesundheitswesen Gesundheitsdatenschutzgesetz-GDSG NW, Paragraph 6).The data set's main characteristics are summed up in Table 1 (detailed information available in Supplementary Table S1).
For all three years, a comparable number of samples is available.Of note, focusing on an event-based analysis, duplicate isolates were included if the interval between antimicrobial susceptibility testing was ≥ 7 days to consider changes of antimicrobial resistances over time 25 .For all years, data based on the same seven specimens are available: blood culture, deep respiratory secretion, deep swab/tissue, foreign body, punctate/secretion, superficial swab and urine.Due to data privacy, all clinics haven been re-named.

Results
For the interactive analysis of AMR, we developed the generic framework GEFAAR.On September, 1st 2022, it was launched at the UKM.Currently, GEFAAR is used for the analysis of 56,852 samples.

Pathogen statistics
The pathogen statistics provide count tables for the number of detected species within a selected year, stratified into specimens in which they were detected.Integration over all vs a specific clinic/unit can be chosen.A cut-off value of ≥ 30 samples, suggested by GLASS 2 , is enabled by default.Results, showing the top-10 species detected in 2020 vs 2021 vs 2022 over all clinics vs clinic 36 are summed up in Table 2 (screenshots of the interactive output generated with GEFAAR available in Supplementary Information, Figs.S2-S7, exported files containing information on all species are provided as Supplementary Tables S2-S7).It can be observed that Escherichia coli was the most abundant species in samples analyzed at the UKM (2020: 21.4%; 2021: 20.8%; 2022: 18.7%), followed by Staphylococcus aureus (10.0%vs 10.3% vs 10.8%) and Staphylococcus epidermidis (7.6% vs 7.4% vs 7.2%).For clinic 36, E. coli can also be observed as the most abundant species.In second place, however, is Enterococcus faecium (rank 6 over all clinics in 2020, rank 7 in 2021 and 2022).
With respect to specimen, considerable species-dependent differences can be observed as one would expect.While E. coli is most commonly detected in urine, it is only rarely detected in foreign bodies (e.g.i.v.catheters).However, a slight trend towards increasing proportion in foreign bodies can be observed (2020: 3.3%; 2021: 3.5%; 2022: 5.8%).

Resistance statistics
For a selected year, specimen (optionally: all), clinic/unit (optionally: all) and species (optionally: all), GEFAAR performs statistical analysis of resistance.A tabular overview of the antimicrobial agents, the frequency of susceptible (S), susceptible with increased exposure (I) and resistant (R) test results 9 , as well as the 95% confidence intervals (CIs) for the resistance rates are generated and provided as 'data sheet antimicrobial agents' .If data on more than one species is available for the selected configuration, information on all species is reported below each other.In accordance with common practice, evaluation requires ≥ 30 isolates per species 2 .In addition, a threshold of 30 is also applied for each antimicrobial agent to ensure validity of the data and a reasonable length of the confidence intervals.To demonstrate the function of GEFAAR, output of the data sheet, providing detailed information on the resistance of E. coli towards antimicrobial agents in 2020 vs 2021 vs 2022 (specimen: urine, clinic/unit: all) is provided in Table 3 (screenshots of the interactive output generated with GEFAAR available in Supplementary Information, Figs.S8-S10, files exported from GEFAAR available as Supplementary Tables S8-S10, sheet 2).
By default, data on antimicrobial agents are sorted by decreasing susceptibility.Ertapenem, meropenem and tigecycline all feature the highest susceptibility rates (100%).The high number of available samples leads to especially narrow confidence intervals for carbapenems (i.e.ertapenem and meropenem).
In addition to the data sheet, a visual summary of the results is generated, focusing on the resistance rates and their 95% CIs ('figures antimicrobial agents').At a glance, the bar plots allow the identification of antimicrobial agents with the lowest proportion of resistant isolates, including confidence of this assessment.By accurately selecting the specimen and clinic/unit, a physician can make a decision based on data that is exactly matching his/ her situation.Figures summing up the resistance rates for E. coli (specimen: urine, clinic/unit: all) are available in Fig. 2 (screenshots of the interactive output generated with GEFAAR available in Supplementary Information, Figs.S11-S13, files exported from GEFAAR available as Supplementary Tables S8-S10, sheet 3).

Trend analysis
While all essential information on resistance is already provided by the resistance statistics, manually changing the selected year and re-analyzing the data to explore the development of resistance over time is tedious.Therefore, we additionally implemented a module for trend analysis to GEFAAR.For a selected specimen (optionally: all), clinic/unit (optionally: all) and species (threshold ≥ 30), every antimicrobial agent characterized by sufficient data ( ≥ 30 samples per year) is analyzed.If one or more years are characterized by insufficient data (< 30 samples), no resistance rate is calculated for the corresponding years.The remaining years, however, are evaluated.The results of a typical trend analysis (specimen: superficial swab, clinic/unit: all, species: S. aureus, antimicrobial agents: erythromycin and moxifloxacin) are provided in Fig. 3 (screenshots of the interactive output generated with GEFAAR available in Supplementary Information, Figs.S14-S15, files exported from GEFAAR available as Supplementary Tables S11-S12).
A point diagram with connected lines shows the development of resistance over time.Confidence intervals are added to the plots, just like in case of the resistance statistics.For erythromycin (Fig. 3a), a minor decrease in resistance over time can be observed (2020: R = 21.1%;2021: R = 16.6%;2022: R = 15.3%).For moxifloxacin, however, data indicates a considerable increase in resistance (Fig. 3b).In 2020, the estimated resistance rate is R = 25.6% (CI 95% = [20.5-31.2]),while it increased to R = 93.0%(CI 95% = [80.9-98.5]) in the subsequent year.At a glance, visualization by GEFAAR's trend analysis allows to identify this change in resistance rate as a significant increase.

Cluster analyses
GEFAAR offers a set of diverse cluster analyses.They allow for detailed evaluation of antimicrobial resistance for a selected year and specimen (optionally: all) to detect and categorize isolates with similar resistance phenotype characteristics.An analysis can be conducted on two levels: (1) per species, and (2) per clinic/unit.All vs a userdefinable set of species and clinics/units may be evaluated.An analysis per species provides the option to explore the relation between clinics/units and antimicrobial agents.Resistance clusters, indicating clonal expansion/outbreaks within one specific or several clinics/units can generally be detected.The following analysis options are available: a heatmap with data ordered by clinic/unit  and resistance provides a first overview, identifying clinics/units with increased resistance to one or a combination of several antimicrobial agents.A heatmap with data ordered by clinic/unit and date permits assessment of the development of resistance over time.Thereby, spread of a species with a specific resistance profile may be detected.Common hierarchical clustering and visualization via heatmap is equally supported as more advanced clustering via dimensionality reduction, using UMAPs 11 .While information on clinics/units, antimicrobial agents and resistance are directly available also in clustered heatmaps, it is mainly lacking in UMAPs.For the generated UMAPs, GEFAAR provides the option to color clinics/units (to identify clinic-specific resistance profiles at a glance) as well as clusters.Subsequently, additional heatmaps can be generated, providing information on the UMAP clusters as annotation.Heatmaps can be ordered by cluster or clinic/unit.Thereby, details on the resistance profile per cluster and clinic/unit can be further investigated.
To demonstrate the functionalities of GEFAAR, we performed clustering of S. aureus (year: 2021, specimen: superficial swab).Altogether, 747 cases could be evaluated with the selected configuration.A heatmap with data ordered by (1) clinic/unit and (2) date is shown in Fig. 4a, a heatmap with annotated UMAP clusters, ordered by cluster and clinic/unit is shown in Fig. 4b (heatmap with data ordered by (1) clinic/unit and (2) resistance available in Supplementary Information, Fig. S16; UMAP with colored clinics/units in Fig. S17; UMAP with colored clusters in Fig. S18; heatmap with data ordered by clinics/units and annotated UMAP clusters in Fig. S19; hierarchical clustering could not be conducted; cluster analyses exported from GEFAAR available as Supplementary Data S1).
For the heatmap ordered by clinic/unit and date (Fig. 4a), data on 30 clinics and 30 antimicrobial agents is available.As only lenient filtering for missing data is applied, some antimicrobials are included despite featuring a relatively high level of missing data (95% missing for ciprofloxacin, 94% for moxifloxacin).It can be observed that samples characterized by resistance towards one or more antimicrobial agents are randomly distributed across the different clinics.An accumulation of resistance over the year cannot be observed.
Clustering by dimensionality reduction (UMAP) requires strict filtration of missing values.As a consequence, ciprofloxacin and moxifloxacin had to be excluded from further analysis of S. aureus clusters.Analysis by UMAP shows a clear separation of data (Supplementary Information, Fig. S17).Clustering suggests presence of four distinct clusters, each of them characterized by a specific resistance profile (see Fig. 4b): Cluster 1 is classical penicillin-susceptible S. aureus.Susceptibility to all relevant antibiotics can be observed.Cluster 4 is typical penicillin-resistant, but oxacillin-susceptible S. aureus, reflecting the marked increase in penicillin-resistance in the past century.Isolates in cluster 3 show resistance to penicillin and also to azithromycin, clarithromycin, erythromycin and piperacillin.In most cases, resistance to clindamycin can additionally be observed.While clusters 1, 3, and 4 reflect fairly typical S. aureus that can also observed in the community, cluster 2 unites diverse isolates with considerably more resistances.Two subclusters can be observed in both the UMAP and the heatmap that can be distinguished as oxacillin-resistant (MRSA) vs oxacillin-susceptible (MSSA).Considering clinics (annotation in second row), no association with any of the four clusters can be observed.Thus, our results indicate that no outbreak-especially of multiresistent S. aureus-has taken place.
To perform an in-depth analysis of the relation between species vs resistance, clustering on clinic-/unit-level is supported.We performed analysis of clinic 01 (year: 2021, specimen: all).A heatmap with data ordered by species is available in Fig. 5a, hierarchical clustering in Fig. 5b (cluster analyses exported from GEFAAR available as Supplementary Data S2).
For the heatmap ordered by species (Fig. 5a), data on 16 species and 47 antimicrobial agents is displayed.Due to lenient filtration for missing data, species like Mycobacterium avium, characterized by 89% missing data, are included in this general overview.With respect to hierarchical clustering (Fig. 5b), we exclude species and samples with ≥ 70% missing data.As a result, information on only 10 species and 37 antimicrobials remains.Analysis reveals two distinct clusters, characterized by different resistance profiles.However, no major patterns, crossing the species boundaries, can be observed.

Discussion
In this work, we introduced GEFAAR-a novel, generic approach for assessing AMR in individual hospitals.To the best of our knowledge, GEFAAR is the first application providing not just common pathogen and resistance statistics, but also an easy-to-use interface to perform trend analysis as well as advanced cluster analyses.
It may be argued that a plethora of systems to monitor and analyze AMR already exist.In their systematic review in 2020, Diallo et al. 26 identified 71 surveillance systems.However, these systems are commonly maintained externally.The information they analyze and display differs, partly considerably.Furthermore, systems are mainly available in developed countries.
Recently, the R package ' AMR' was published to ease working with data on antimicrobial resistance 27 .An extensive set of functions is available, e.g.filtering data, calculating antimicrobial resistance or determining a regression model to predict future AMR.However, the software is-primarily-a statistical software.Despite providing several tutorials, advanced programming skills are inevitably required to perform analyses with the R package AMR, including the export of tables or plots exceeding the implemented bar plot option.
We hold the view that a surveillance system is only best if it is tailored to local needs and easy to use to increase acceptance.For this reason, GEFAAR was developed in close collaboration with end-users.Following their requests and suggestions, we implemented an intuitive, user-friendly interface.For the pathogen and resistance statistics, we took our cue from the well-established ARS of the RKI-Germany's public health institute.We developed the configuration panel, the interactive results as well as the Excel export following the RKI example.However, we added further features to this basic interface based on user feedback, e.g.reporting the pathogen statistics for all specimens separately, in addition to total counts.In the same design, we implemented a trend analysis and a set of cluster analyses.Heatmaps allow for visualizing a large amount of information in a clear way.Hierarchical clustering was chosen as a relatively easy and comprehensive way of clustering.Dimensionality reduction and clustering via UMAP was selected as a more advanced clustering approach, providing an option to explore complex patterns of resistance in the high-dimensional data we are dealing with.
While surveillance systems like the German ARS provide database updates only once a year, in GEFAAR, we implemented an upload option.Minimum input format requirements allow to analyze a hospital's routine data with respect to AMR.Thereby, GEFAAR provides an easy-to-use option to study AMR including small hospitals in rural areas and developing countries that are often not considered by the common national and international ARS.Furthermore, as GEFAAR allows for the immediate analysis of data, it provides the framework for early detection of emerging AMR clusters so that quick action can be taken.www.nature.com/scientificreports/Programming knowledge is not required for any of the analyses to be conducted with GEFAAR.At the UKM, a local server was set up to run our software.Thereby, it can be reached within the hospital's intranet with any web-browser as a simple interactive website.No tools have to be installed.Additionally, all data uploaded to GEFAAR for analysis are securely kept within the hospital.As an alternative, GEFAAR can also be run on a local computer, requiring only an installation of R. The software code is freely available at https:// github.com/ sandm anns/ gefaar.In addition, the web-application can be directly accessed on the public server https:// gefaar.uni-muens ter.de.The infection prevention and control (IPC) board of the UKM has advised all prescribers the use of GEFAAR.
As future work, we plan to extend functionalities of GEFAAR.Regarding resistance statistics and trend analysis, options for additionally including data from public databases like the ARS of the RKI will be examined.This would allow a user to better classify the results.Possible bias caused by selection of the samples and tests, leading to an overestimation of resistance compared to the average population, could be investigated.With regard to cluster analyses, we will explore further analyses that, for example, look more closely at the association of clinics/units with resistance clusters.Additionally, alternative clustering approaches and configurations will be further explored, including our algorithm estimating the optimum number of clusters, the minimum and the maximum number of clusters considered.
Concluding, GEFAAR represents a novel option for the interactive analysis of AMR, providing basic statistic as well as advanced cluster analyses.Due to its generic framework, tabular data can be imported and analyses conducted independent of externally maintained databases.Thereby, GEFAAR provides guidance for empirical antimicrobial therapy and support to detect AMR clusters within or beyond clinics/units if other platforms are not available (e.g.whole genome sequencing).

Data availibility
GEFAAR, including simulated data, is freely available at https:// github.com/ sandm anns/ gefaar.Results of all analyses conducted with GEFAAR during this study are included in this published article and its Supplementary Information files.The dataset analysed during the current study is available from the corresponding author on reasonable request.

Figure 3 .
Figure 3. Trend analysis showing the development of resistance of Staphylococcus aureus (specimen: superficial swab, clinic/unit: all).(a) For erythromycin a minor decrease in resistant isolates can be observed.(b) For moxifloxacin data indicates a significant increase in resistance.

Figure 4 .
Figure 4. Exported GEFAAR cluster analyses, evaluating data on Staphylococcus aureus (year: 2021, specimen: superficial swab).Every column represents one sample, every row one antimicrobial agent.Colors indicate blue: susceptible, yellow: susceptible with induced exposure, red: resistant, grey: no data available.(a) Heatmap ordered by 1) clinic/unit and 2) date.(b) Heatmap showing data ordered by UMAP clusters and clinic/unit (annotated clusters in top row).

Table 1 .
Overview of the real dataset analyzed with GEFAAR.Samples were collected at the University Hospital Münster between 2020 and 2022.

Table 3 .
Resistance statistics: data sheet antimicrobial agents for Escherichia coli comparing 2020 vs 2021 vs 2022 (specimen: urine, clinic/unit: all).N number of samples, S% rate of susceptibility, I% rate of susceptibility increased exposure, R% rate of resistance, 95% CI R 95% confidence intervals for rate of resistance.