A European-wide dataset to uncover adaptive traits of Listeria monocytogenes to diverse ecological niches

Listeria monocytogenes (Lm) is a ubiquitous bacterium that causes listeriosis, a serious foodborne illness. In the nature-to-human transmission route, Lm can prosper in various ecological niches. Soil and decaying organic matter are its primary reservoirs. Certain clonal complexes (CCs) are over-represented in food production and represent a challenge to food safety. To gain new understanding of Lm adaptation mechanisms in food, the genetic background of strains found in animals and environment should be investigated in comparison to that of food strains. Twenty-one partners, including food, environment, veterinary and public health laboratories, constructed a dataset of 1484 genomes originating from Lm strains collected in 19 European countries. This dataset encompasses a large number of CCs occurring worldwide, covers many diverse habitats and is balanced between ecological compartments and geographic regions. The dataset presented here will contribute to improve our understanding of Lm ecology and should aid in the surveillance of Lm. This dataset provides a basis for the discovery of the genetic traits underlying Lm adaptation to different ecological niches. Measurement(s) whole genome sequencing Technology Type(s) Illumina Sequencing Factor Type(s) Multi-locus sequence types • Geographic location • Animal associated environment isolates • Food product and food production environment isolates Sample Characteristic - Organism Listeria monocytogenes Sample Characteristic - Environment Farm • Ruminant • Agricultural soil • Wild animals • food processing building • dairy food product • meat or meat product (from mammal) (us cfr) • chicken meat food product • fish food product • vegetable or vegetable product (us cfr) Sample Characteristic - Location Europe Measurement(s) whole genome sequencing Technology Type(s) Illumina Sequencing Factor Type(s) Multi-locus sequence types • Geographic location • Animal associated environment isolates • Food product and food production environment isolates Sample Characteristic - Organism Listeria monocytogenes Sample Characteristic - Environment Farm • Ruminant • Agricultural soil • Wild animals • food processing building • dairy food product • meat or meat product (from mammal) (us cfr) • chicken meat food product • fish food product • vegetable or vegetable product (us cfr) Sample Characteristic - Location Europe


Background & Summary
Listeria monocytogenes (Lm) is a facultative intracellular pathogen responsible for listeriosis, a serious disease affecting both humans and animals. Lm is a ubiquitous bacterium that is found in various ecological niches, including the natural and farm environments 1,2 . In particular, soil is a primary ecological niche of Lm and may thus be important in its transmission from natural/farm environment to food and food-processing environment (FPE) 1,2 . Farm animals, in particular ruminants, are also an additional important reservoir for Lm and contribute to contamination of the farm environment through fecal shedding 3,4 . In addition, Lm can persist for a long time in soil and the farm environment. Increasing amounts of information are also available on the prevalence of Lm in wildlife, showing that various animal species (e.g., deer, wild boars, bears, foxes, monkeys, rodents, hedgehogs, snails, slugs and birds) can act as a vehicles for this pathogen [5][6][7][8][9][10][11] . These findings point to an ecological role of wildlife as a reservoir of Lm and its potential importance in Lm infection cycle.
Lm is genetically heterogeneous species divided into four phylogenetic lineages, of which lineages I and II are the most frequently encountered. Multilocus sequence typing (MLST) classifies Lm into clonal complexes (CCs) and sequence types (STs), which are systematically used to describe its population structure [12][13][14] . Certain epidemiological clones account for the majority of outbreaks and sporadic cases in humans 15 and animals 16 , worldwide 13,17 . The CCs that are commonly found in food and FPE, such as the most common CC9 and CC121, but also CC1, CC2, CC4, CC5, CC6, CC8 and CC37 18 , pose a serious challenge in food industry 15,18,19 . Moreover, they can persist in FPE for several years [20][21][22][23][24] . Remarkably, CC9 and CC121 are rarely reported in animals or natural/farm environments 18,25 .
In order to improve surveillance and the management of health risks associated with Lm, a deeper understanding of the genetic make-up of strains adapted to food and FPE is required. As part of the Horizon 2020 "One Health" European Joint Programme, the 3-year research project "LISTADAPT" (Adaptive traits of Listeria # A full list of authors and their affiliations appears at the end of the paper. Data DESCrIptor opEN monocytogenes to its diverse ecological niches -https://onehealthejp.eu/jrp-listadapt/) aimed to identify the genetic mutations and mobile genetic elements underlying the adaptation of Lm to different ecological niches. With this objective in mind, strains were collected from i) farm environment and animals and ii) natural environment and wild animals to study their genetic make-up and to compare this background with that of strains isolated from food products and FPE. This work was made possible due to the LISTADAPT consortium which included (i) seven national reference laboratories (NRLs) for surveillance of Lm in food, animals and the environment (AT, CZ, DK, FR, IT, NO and SE) and (ii) three research laboratories at INRAE (the French National Research Institute for Agriculture, Food and Environment). Out of the seven NRLs, two are also national public health laboratories (AT and CZ) that are in charge of the surveillance of clinical strains isolated in outbreaks and sporadic cases. In addition, 14 institutes from 12 countries participated as external partners providing isolates.
In this data descriptor, we present a dataset of 1484 high-quality draft genomes originating from Lm strains isolated in 19 European countries within the framework of the LISTADAPT project. The constructed dataset cover a wide genetic diversity of Lm since it includes about 79 different CCs and singleton STs including the most prevalent CCs in Europe 15 and worldwide 13,17 . The strains were collected from natural environment (wild animals and natural environment), primary production (farm environment and farm animals with or without listeriosis symptoms) until FPE and food products.
The constructed dataset provides a better understanding of the Lm transmission routes from the farm/natural environment to food and FPE and improves our understanding of its ecology. The dataset may also help to assess the importance of animal and food strains for human infection. Moreover, it can be used by the scientific community (i) to improve our understanding of the Lm population structure and the Lm evolutionary history, (ii) to facilitate the detection of the emerging Lm clones and (iii) to identify genetic traits related to the adaptation of Lm to particular ecological niches (ecophysiology). Such genetic traits could be used in the development of molecular assays for screening of food/FPE, animal and soil reservoirs.

Methods
Construction of the LIStaDapt dataset (n = 1484). In order to build a dataset of Lm draft genomes suitable for investigating the adaptive traits of Lm to diverse ecological niches, we gathered a curated collection of Lm draft genomes. Strains isolated over the period 2010-2020 were preferred, regardless of their origin of isolation. We considered two geographic levels, (i) the 27 EU countries including Norway and Switzerland, heterogenous in size, population, climate, ecology and economical activities and (ii) based on country borders four European regions roughly equal in terms of surface area without consideration for other criteria (South-West, Central-South, Eastern and Northern). We included strains that were distributed evenly among these four European regions. The strain were gathered from already available strain collections and extensive sampling campaigns (Fig. 1). The LISTADAPT dataset was divided into two main ecological compartments: (i) C1 compartment, which included strains from animals and the natural/farm environment (n = 756), and (ii) C2 compartment, which included strains from food (n = 728) ( Table 1).
Strains selected from the initial collection of the LISTADAPT consortium. At the beginning of the LISTADAPT project, the consortium had access to a collection of about 8000 food and animal Lm strains obtained from collaborative projects or national surveillance. Most of these strains were isolated from food, whereas the remainder were isolated from animals (C1 compartment: animal and environmental strains) with a substantial under-representation of certain animal species. This compartment mainly included strains isolated from animals showing listeriosis-related symptoms. Few strains were available from asymptomatic animals, soil and the agricultural environment, originating from three European countries (France, Italy and the Czech Republic).
Animal and environmental strains included in the collection during the LISTADAPT project (n = 756). We collected isolates from animals showing listeriosis associated symptoms, asymptomatic animals, soil and the environment, in a large number of countries across Europe. These strains were isolated between 1978 and 2019. Regarding environmental niches, the consortium selected strains from continental environments remote to cities, large rivers and estuaries or marine environment to avoid the selection of human or food strains released in the environment, detailed strain information were provided in Figshare File 1 26 . However, the six strains described by Szymczak et al. 27 (Table 2) were isolated from city outskirts parks in Poland, distant from the city center. Similarly, the 47 strains from birds (mainly seagulls) (Hellström et al.) 10 were isolated from localities from on the outskirts of Helsinki, Finland (Table 2). Strains obtained from existing microbial collections (n = 648)To increase the size and representativeness of the Lm genome dataset the LISTADAPT consortium performed an extensive review of all recent collections of published and unpublished Lm strains and then contacted researchers in charge of these collections. Finally, 14 external partners, food and veterinary laboratories and research institutes, all dealing with Lm hazards in Europe, collaborated with the LISTADAPT consortium (Tables 2 and 3).
The initial collection included more strains from animals with listeriosis-associated clinical symptoms than without symptoms. In order to reduce the number of strains originating from animals with listeriosis while maintaining maximum diversity of the dataset, we adopted an original method to select the strains based on metadata (e.g., type of sample, geographic location, time of isolation, molecular typing data such as PFGE profiles, animal species and geographic sampling location). This method relies on Gower's coefficient (GC), which is a dissimilarity measure: the "distance" between two units is the sum of all the variable-specific distances (associated with metadata categories). The GC metric enables the combination of numeric and categorical data and enables applying weights to each variable, effectively altering the importance of each metadata category www.nature.com/scientificdata www.nature.com/scientificdata/ (e.g., geographical region as a more important category than year of isolation). The three steps are: (i) calculating the dissimilarity matrix based on Gower's distance (ii) clustering the dissimilarity matrix with hierarchical clustering (agglomerative bottom-up approach of clustering) and (iii) assessing clusters with the "Silhouette" method. The silhouette plot displays a measure of how close each point in one cluster is to the points in the  www.nature.com/scientificdata www.nature.com/scientificdata/ neighboring clusters. An R script available at https://github.com/lguillier/LISTADAPT/tree/master/metada-ta2assocation was used to perform the selection of strains based on this method. This script takes as input a Comma Separated Values (CSV) file that includes strain ID and metadata information, then outputs a CSV file of selected strains.
In the present study, we constructed a large dataset comprising 301 animal and environmental Lm strains from six European countries and published collections (Table 2), as well as 347 animal and environmental Lm strains from 12 European countries that were obtained from non-published collections (Table 3). Strains collected from sampling campaigns (n = 108)Soil, farm, and wild animal samples were collected in nine European countries (Table 4). For the collection of soil samples, the LISTADAPT project members raised awareness and organised crowd-sampling campaigns. All the soil samples were collected from agricultural or wild areas according to a common procedure provided to the samplers based on the existing recommendations reported in the literature 2,28-30 . The integration of feedback from samplers enabled a continuous improvement of the sampling protocol. The sampling campaigns were conducted in 17 areas in seven EU member states, Norway and Switzerland (Figs. 1 and 2, Table 4), namely AT, CH, CZ, FR, IT, NO, SE, SI and SK, resulting in the isolation of 58 Lm strains. Out of the 1752 available sampling records, the overall prevalence was 3%. We confirm in the present study the low prevalence of Lm in soil reported in the literature (below 1% and up to 6% depending on soil type) 2,29 . Soil strains from AT, FR, SI and SE were isolated by employing a two-step specific enrichment: the first enrichment was performed with modified Listeria Enrichment Broth for 24 h at 30 °C, followed by enrichment in University of Vermont Medium (UVM) enrichment broth for 48 h at 30 °C. Detection of Listeria spp. and Lm was then achieved by specific SYBR Green real-time PCR targeting prs2 and inlA genes, respectively. The samples positive for the presence of Listeria spp. and/or Lm were spread on RAPID'L.Mono agar plates (BioRAD, France). After 24 h incubation at 37 °C, colonies characteristic of Lm and other Listeria species were picked, purified and stored at -80 °C in Tryptone Soya Broth supplemented with 25% (v/v) glycerol. Strains from CH, CZ, IT and SK were isolated with the EN ISO 11290-1:2017 protocol (Horizontal method for the detection and enumeration of Lm and of Listeria spp.).
Regarding the subcompartments of farm and wild animal, 50 Lm strains were isolated from sampling campaigns. Three campaigns targeting shelled gastropods sampled in IT, SK and CH resulted in the isolation of six strains (Figs. 1 and 2, Table 4). Sampling campaigns were also carried out for wild deer and reindeer feces in Southern Norway, and from cattle, roe deer, wild boar, wolf, bear and fox feces in the Abruzzo and Molise regions of Italy (Fig. 1, Table 4). Of the 2577 samples collected from vertebrates during the campaign conducted in IT and NO 41 isolates were detected, with an overall prevalence of 1.6%.
Food strains included in the collection during the LISTADAPT project (n = 728). The food strains (C2 compartment) were classified according to the five main categories of risk food matrices for Lm defined by the European Food Safety Authority (EFSA) 31 : dairy products (n = 119), fish and fishery products (n = 165), meat products (n = 246), vegetables and fruits (n = 95), and composite dishes (food products combining several food categories) (n = 103). Six NRL project partners (AGES, ANSES, DTU, IZSAM, SLV and VRI) were instructed to target a maximum of 30 strains per food category from their strain collections, preferring strains isolated in the last  www.nature.com/scientificdata www.nature.com/scientificdata/ 10 years. This time period was extended to the under-represented categories (vegetables and fruits); the final dataset included strains originating from the 2002-2020 period. We excluded raw materials from the selection based on the assumption that they could be contaminated by strains originating from farms or animals. The 728 strains from C2 compartment were isolated along the food chain, from food processing plants to food retail in several EU countries (Table 1), detailed strain information were provided in Figshare File 1 26 Complete LIStaDapt dataset (n = 1484). The final LISTADAPT strain dataset that we constructed in collaboration with external partners was balanced with regard to the two main compartments: C1 (animals/ environment, n = 756) and C2 (food/FPE, n = 728) ( Table 1). The geographic distribution covered 19 of the 27 EU countries plus Norway and Switzerland (Figs. 1 and 2).
Although the C1 compartment (n = 756) covered a 41-year period (1978-2019), most of the strains (75%) were isolated since 2010. This panel covered all successive years between 2009 and 2019 in at least three European regions (Fig. 1c). Between 2008 and 2019, except for the year 2013, the C1 compartment covered all successive years in the following three categories of subcompartments: farm animals, wild animals and natural/ farm environment (Fig. 1d).  www.nature.com/scientificdata www.nature.com/scientificdata/ Although the C2 compartment (n = 728) covered an 18-year period (2002-2020), most of the strains (78%) were recent, i.e. having been isolated between 2013 and 2019 (Fig. 1b). This panel covered all successive years between 2013 and 2019, as well as the five major categories in at least three European regions (Fig. 2b).
Overall, the 1484 strains clustered into 137 MLST STs, which belonged to 54 CCs and 25 singleton STs (Fig. 3). For 22 strains, the allele profile was unknown (novel ST) or incomplete (When six out of seven MLST alleles were present, a CC was assigned when possible).
Standard strain nomenclature. In order to facilitate data sharing between partners, we adopted a standard nomenclature for strain identification (ID). This nomenclature was used as metadata codification to allow for fast identification of the geographic region and detailed isolation source of the strains (e.g., wild animal, food product or farm environment). In more detail, the LISTADAPT code has between 10 and 15 characters; the first two letters (level 1) correspond to the country code (ISO 3166-1-alpha-2 code), which is followed by a code detailing the origin of the strain and the sample type (level 2 to 4, depending on the nature of the source). Briefly, level 2 details the type of sample (e.g., animal species, environment and food categories) and level 3 details the nature of the sample (e.g., type of animal sample, type of food and nature of environmental sample). The level 4 gives additional information about the sample (e.g. type of preparation for aliments and health status of the animals). The code ends with a sequential number for each country, generated when the strain was added to the collection. For example: the strain DE-RDE-CP-13 was isolated in Germany (DE) from a roe deer (RDE) as a clinical strain (CP) and it was the 13 th strain isolated from Germany included in the dataset. The Supplementary  Table S2 provides a detailed overview of the employed LISTADAPT codification.
Whole Genome Sequencing and genomes data analysis. The next generation sequencing (NGS) paired-reads (2 × 150 bp) were generated during the project with Illumina platforms. Four LISTADAPT partners (AGES, IZSAM, ANSES and DTU) mainly performed the sequencing. Figshare File 1 26 lists the sequencing technology and the center which performed the library preparation and produced the sequences.
The genomes were all de novo assembled and annotated with a harmonized in-house workflow named ARTwork (Assembly of reads and typing workflow) 32 used in the ANSES Laboratory for Food Safety. In addition to de novo assembly, the ARTwork pipeline also performs genome annotation using Prokka 33 . This whole genome sequencing (WGS) workflow has been described in detail in previous publications 32, [34][35][36] , including the integrated bioinformatics tools and their corresponding versions, enabling repeatability and comparability of the results 2 (   Table 4. List of 108 animal and environment strains from sampling campaigns. CS, Clinical Symptoms. The reported clinical symptoms included rhombencephalitis, abortion, septicemia and mastitis/subclinical mastitis. The type of clinical samples included cerebellum/brain tissue, aborted fetus, fetal membrane, liver, internal organs, feces and milk. NS, No listeriosis-associated Symptoms www.nature.com/scientificdata www.nature.com/scientificdata/ Quality control of WGS data. Poor-quality reads or assemblies as well as contaminations can significantly affect gene prediction and cluster analyses 38,39 . Different WGS metrics and quality criteria were thus employed in the ARTwork pipeline to ensure high-quality WGS data. Reads with an estimated depth of coverage < 30 × (as estimated by BBmap 40 ) as well as contigs and scaffolds with a length of < 200 bp were excluded (n = 22). Draft genomes with a total length outside the range of 2.7-3.3 Mb and with a total number of scaffolds > 200 (n = 46) were also excluded. In addition, inter-and intra-species contamination of reads was determined using the recently developed ConFindr software (v0.5.1) 41 . Since recently demonstrated, inter-and-intra species contamination of 10 single nucleotide variants (SNVs) assessed by ConFindr in the conserved core genes does not significantly impact cluster analysis 39 . We decided to exclude all genomes presenting SNVs lower than this cut-off (n = 12) as well as various read-or assembly-related errors (n = 34).
The employed WGS metrics and quality criteria of the complete LISTADAPT genome dataset are reported in Figshare File 1 26 . In total, 114 sequenced genomes were of unsatisfactory quality after quality control and were thus excluded from the final dataset. After quality control of NGS and WGS data, the final LISTADAPT dataset included 1484 genomes.
Metadata and WGS data sharing. All metadata and WGS data collected herein were centralized and processed with standardized criteria for common nomenclature and NGS/WGS quality control before sharing between project partners. Reads normalized to 100 × coverage, draft assemblies (contigs and scaffolds) and annotated genomes (Genome Feature Format, GFF, and Genbank format, GBK) were also centralized at the MongoDB database located at ANSES (Maisons-Alfort Laboratory for Food Safety) providing quickly available, ready-to-use data.
Raw (non-normalized) reads for all the Lm strains sequenced in the LISTADAPT collection (n = 1484) were submitted to the NCBI Sequence Read Archive (SRA) for sharing with the LISTADAPT project's partners. Raw (non-normalized) reads for 67 Lm food strains obtained from previous publications 19,42 were submitted to the NCBI Sequence Read Archive (SRA) database and were linked to their existing accession numbers in Figshare File 1 26 . www.nature.com/scientificdata www.nature.com/scientificdata/

Data records
All high-quality WGS data from this data descriptor are available for download at SRA/ENA public repository, including the sequences already available at the beginning of this study 43  technical Validation redundant strains. The LISTADAPT dataset was analyzed by core-genome MLST (cgMLST) analysis, using BioNumerics (Table 5) according to a fixed cgMLST scheme consisting of 1748 Moura et al. loci 45 . All strains with genomes presenting less than < 7 allele differences (AD), isolated in the same year, as well as sharing the same source of isolation and sharing identical geographic location (same region or country) were considered as redundant. When the latter information was not available, the provider was used instead. Although year of isolation was unknown for four strains, they were marked as redundant because of similar cgMLST (<7 AD). Among the 1484 strains, 157 were identified as redundant. These strains were maintained in the dataset and marked accordingly (Figshare File 1 26 ) Consistency analysis. The present study includes 648 strains from existing collections and 108 strains isolated in the framework of this study. The strains from historical collections were provided from 19 different laboratories. The management of large strain collections may lead to storage issue such as the isolation of two strains in the same tube. Furthermore, the sequencing of the strains involved several handling that may lead to human error.   www.nature.com/scientificdata www.nature.com/scientificdata/ For 380 of the 648 strains provided by partners, historical typing data were available. We established links between these typing data provided and the sequence obtained. These typing data were either, conventional serotyping data, molecular serotyping or MLST obtained by individual allele sequencing or mapping from PFGE. For conventional serotype the correspondence with the MLST type obtained from WGS was established following correspondence based on Ragon et al. 12 . The correspondence with molecular serotyping was established based on Hyden et al. 46 mapping system using the Software SeqSphere (Table 5). For the strains isolated in Belgium (Table 3) the correspondence with PFGE was applied by our partner, based on the methodology described in Félix et al. 18 . For the strains isolated in Finland (Tables 2 and 3), the correspondence with PFGE was applied by our partners according to their in-house mapping methodology. The observed discordances were investigated with the partners. The concerned strains were re-sequenced if needed and discarded when unresolved. All results were reported in the Figshare File 1 26 .

Code availability
The ARTwork pipeline, described in the WGS quality control section is publicly available at https://github.com/ afelten-Anses/ARtWORK. The employed bioinformatics tools and their versions are specified in Table 5.

acknowledgements
This work was supported by the One Health European Joint Programme, European Union's Horizon 2020 research and innovation programme (Grant Agreement No 773830). The authors would like to acknowledge Karol Romero and Karine Capitaine from ANSES, Laboratory for Food Safety, SEL Unit, for their technical help in the strain culture management. The authors also thank Thomas Berger, Jan-Erik Ingenhoff and René Imhof from Agroscope for collecting strains from soil and wild animals in Switzerland and for providing the associated genomes.

author contributions
All authors read and approved the final manuscript. Benjamin Félix contributed substantially to the study design, soil sampling, acquisition of strains and the corresponding genomes, data analysis as well as writing and editing of the manuscript. Yann Sévellec contributed extensively to the acquisition of strains and the corresponding genomes, data analysis, quality control of the WGS dataset as well as writing and editing the manuscript. Federica Palma contributed to the acquisition of WGS data, data analysis and quality control of the WGS dataset and to writing and revision of the manuscript. Arnaud Felten, Nicolas Radomski and Ludovic Mallet performed the assembly centralized in a meaningful database. Pierre Emmanuel Douarre contributed substantially to the technical validation of the dataset. Pascal Piveteau participated in the study design and collected strains from natural environment in France. Pascal Piveteau and Eliette Ascensio performed the detection of Lm for part of the soil samples collected by the partners. Christophe Soumet, Arnaud Bridier and Michel Hébraud contributed to the study design and the manuscript revision. Marina Torresi and Francesco Pomilio collected strains from natural environment and provided strains and genomes from animals and food in Italy. Cesare Cammà and Adriano di Pasquale were responsible for sequencing part of the strains. Taran Skjerdal collected strains from natural environment and provided strains from wild animals and food in Norway. Ariane Pietzka and Werner Ruppitsch performed soil sampling and provided WGS data of strains isolated from food in Austria. Renáta Karpíšková collected and provided strains from soil, ruminants and food isolated in the Czech Republic. Tereza Gelbíčová contributed to the selection and characterization of strains, DNA extraction in the Czech Republic and partly to data sequencing. Monica Ricao collected strains from soil and provided strains from wild animals and food isolated in Sweden. Bojan Papić collected strains from soil, provided strains from ruminants and natural environment in Slovenia and partly contributed to data sequencing. Bart Wullings provided strains and genomes from ruminants previously isolated from the Netherlands. Hana Bulawova provided strains from wild animals and ruminants isolated in Czech Republic. Hanna Castro, Miia Lindström and Hannu Korkeala collected and provided strains from wild and farm animals from Finland. Žanete Šteingolde and Toomas Kramarenko provided strains from farm animals isolated from Latvia and Estonia. Lenka Cabanova provided strains from ruminants and wild animals isolated from Slovakia. Barbara Szymczak provided strains from soil isolated from Poland. Verena Oswaldi provided strains from farm animals isolated from Germany. Manfred Gareis provided strains from wild animals isolated from Germany. Anne-Mette Seyfarth provided genomes from food strains isolated