Monitoring the West Nile virus outbreaks in Italy using open access data

This paper introduces a comprehensive dataset on West Nile virus outbreaks that have occurred in Italy from September 2012 to November 2022. We have digitized bulletins published by the Italian National Institute of Health to demonstrate the potential utilization of this data for the research community. Our aim is to establish a centralized open access repository that facilitates analysis and monitoring of the disease. We have collected and curated data on the type of infected host, along with additional information whenever available, including the type of infection, age, and geographic details at different levels of spatial aggregation. By combining our data with other sources of information such as weather data, it becomes possible to assess potential relationships between West Nile virus outbreaks and environmental factors. We strongly believe in supporting public oversight of government epidemic management, and we emphasize that open data play a crucial role in generating reliable results by enabling greater transparency.


Background & Summary
West Nile Virus (WNV) belongs to the Flaviviridae family, genus Flavivirus which is a single-stranded, positive sense RNA virus Rossi et al. (2010), and was firstly discovered in an Ugandan woman in 1937 Smithburn et al. (1940).These viruses are named Arboviruses (Arthropod-born viruses) and are typically transmitted by bites or punctures of ticks and mosquitoes Holbrook (2017).Before 1990, sporadic cases and mild outbreaks occurred, except in Israel and France.After 1990 several outbreaks have been observed in Algeria, Morocco, Tunisie, Italy, France, Romania, Israel and Russia, with neurological complications and deaths.In the summer of 1999, a New York cluster with its genomic sequence demonstrated the Israelian origin of the strain Rossi et al. (2010).It is unknown how the virus crossed the Atlantic Ocean with subsequent additional spreading in Canada, the USA, Mexico, the South America Caribbean Area, Venezuela, Chile, and Argentina.South Africa and the West hemisphere are other infection cluster zones.In Italy, West Nile virus (WNV) was first detected in Toscana back in 1998 Mencattelli et al. (2021).The regions of Emilia-Romagna and Veneto, which surround the Po River delta, were particularly affected.Since then, WNV has been detected every year in the country.To address this ongoing concern, an integrated surveillance plan for Arboviruses was initiated in the Northern Italy regions in 2008 and subsequently extended to cover the entire country Calzolari et al. (2020Calzolari et al. ( , 2022)).Various external factors may contribute to the spread of the virus, including climate change Semenza and Suk (2018), urbanization, ease of travel, and globalization Thomas et al. (2014).Temperature anomalies, in particular, have been found to influence WNV transmission in Europe.They can alter the geographic range of vectors, the aerial

Methods
The main data sources for this study are the bulletins published in PDF format by the Italian National Institute of Health (ISS), in collaboration with Office V of the Ministry of Health's General Directorate for Preventive Healthcare and the Research Centre for Exotic Diseases (Centro studi malattie esotiche -CESME) of the Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise "Giuseppe Caporale" (IZS Teramo).The surveillance initiative was initiated in 2012 following the enactment of DGPRE 0012922-P-12/06/2012, which initially focused on monitoring neuro-invasive infections in humans.However, it was subsequently expanded in 2017 to include animals, specifically mosquitoes, birds, and equids.Therefore, the available data includes information on WNV infections in humans from June 2012 and in animals from August 2017.The data production process, covering from the digitalization to the release of WNVDB, is composed of four main steps that are summarized in Figure 1.In the collection of bulletins step, we downloaded a total of 163 bulletins (up to November 2022) from the EpiCentro website, accessible at https://www.epicentro.iss.it/westnile/bollettino.After the download, in the classification of information step, we organized these bulletins according to the corresponding surveillance category.A standard bulletin typically consists of the following sections: (i) a textual description of human cases categorized by region and infection type, (ii) a section reporting human cases by province and infection type, including a table presenting neuro-invasive cases at the provincial level categorized by age group, (iii) a section providing information on verified outbreaks in equids at the regional and provincial levels in tabular form, (iv) separate sections describing cases in target species and wild birds, and (v) a final section reporting the number of mosquitoes caught and tested positive for the virus at the regional and provincial levels.The most challenging steps were (i) and (ii), which involved extracting information from unstructured textual data.The remaining steps were also not trivial, but were facilitated by the use of an automatic tool called Tabula (https://tabula.technology/).This tool enabled the extraction and conversion of tables from the PDF files directly into data frames.Afterwards, a data pre-processing step was required to ensure coherence and consistency of our dataset.Specifically, (i) a standardized procedure was adopted to encode geographic information following the ISTAT nomenclature, also including longitude and latitude, and (ii) weekly cases have been derived by calculating the first differences of the officially reported cumulative cases ("casi_totali") in the bulletins.Here, two main issues have been addressed.First, the computation of the first differences yielded negative counts for the weekly cases ("nuovi_casi") of WNV: cumulative counts of epidemiological indicators are sadly known to be often affected by data quality issues mainly due to delays in reporting and/or measurement errorsJona Lasinio et al. (2022).Therefore, all negative values of "nuovi_casi" have been set as Not Available (NA) whenever the current week's value of "casi_totali" was smaller than the value of "casi_totali" in the previous week.Second, not all the information about the surveillance was available across years and for the different types of host.Indeed, missing values were sometimes present in different variables: for example, the name of the province was not always indicated for neuro-invasive human infections, but reported as "Not indicated"; also, age-class was sometimes missing.In all these cases, we decided to leave the record in the database as higher level of aggregation (e.g.regional) could still be possible.To check the validity of our procedure, we repeatedly selected a random sample of 100 observations from the outbreak records and manually verified that the information did coincide by accessing the corresponding PDFs.In the case of a mismatch, the procedure was re-run until convergence.Finally, the resulting dataset was released on a GitHub repository.The whole protocol was repeated for all bulletins, separately for each year, with the computational runtime varying from 30 minutes for early bulletins with limited data to several hours for more recent bulletins that contained additional details such as infection type and/or host information.

Data Records
The database consists of one folder for each year, named from the first to the last available surveillance year.Each folder contains: (i) a subfolder called bollettini that includes all the pdf bulletins published in that year; (ii) a subfolder called dati-andamento-nazionale that describes the national trend of WNV cases.In addition, depending on data availability, there are up to 4 subfolders called dati-sorveglianza-*, where "*" stands for humans, equids, birds, and mosquitoes, each of which describes WNV cases per host at the regional and provincial levels.Figure 2 shows a schematic overview of the database structure, while a more detailed description of the folders is reported below.
Figure 2: Schematic structure of our WNVDB.

Bulletins (bollettini)
This folder includes the original pdf files with the data about veterinary and epidemiological surveillance of WNV as they are published by official sources.Each bulletin has been named as "WNV_News_yyyy_#.pdf",where yyyy is the year when the bulletin has been published and # identifies a sequential identification number.
Trend at the national level (dati-andamento-nazionale) This folder contains data aggregated at the national level for WNV weekly and cumulative cases.Such data are organized into .csvfiles, called "wn-ita-andamento-nazionale-yyyy.csv" (where yyyy is the year of monitoring), whose structure is reported in Table 1.In particular, each file has 5 columns and a number of rows equal to T × n host , where T is the number of monitoring weeks and n host is the distinct number of host (0=humans, 1=equids, 2=target birds, 3=wild birds, 4=mosquitoes) for which data are available.Structure of the file "wn-ita-sorveglianza-nazionale-yyyy.csv"within the folder dati-andamento-nazionale.

Human surveillance data (dati-sorveglianza-umana)
This folder contains data about WNV infections in humans, at both the regional and the provincial level, organized into two distinct .csvfiles, namely "wn-ita-regioni-sorveglianza-umana-yyyy.csv" and "wn-ita-province-sorveglianzaumana-yyyy.csv"(where yyyy represents the year of monitoring).The former has a total of 9 columns and each row identifies the weekly number of cases by region and type of infection (i.e., neuroinvasive, fever, blood donor); the latter has instead a total of 13 columns and each row identifies the weekly number of WNV infections by province, age-class (i.e., ≤ 14, 15-44, 45-64, 65-74, ≥ 75) and type of infection.The structure of both csv files is reported in Table 2 and  Table 3, respectively.Note that, from 2013 to 2017 only neuro-invasive cases were reported, while two more types of infection were added in the 2022 surveillance, namely symptomatic and asymptomatic.Finally, although in principle it is possible to aggregate data to the regional level starting from the data at province level, we decided to keep both dataset separated in compliance with the bulletins' structure published by ISS.Table 3: Structure of the file "wn-ita-province-sorveglianza-umana-yyyy.csv"within the folder dati-sorveglianza-umana.

Mosquito surveillance data (dati-sorveglianza-entomologica)
This folder contains the weekly cases of WNV infections in mosquitos at the province level.It includes the file "wn-ita-sorveglianza-entomological-yyyy.csv"(where yyyy is the year of monitoring), the description of which is given in Table 4. Structure of the file "wn-ita-sorveglianza-entomologica-yyyy.csv"within the folder dati-sorveglianza-entomologica.

Equids surveillance data (dati-sorveglianza-equidi)
This folder contains weekly cases of WNV infections in equids at the province level.Differently from the surveillance of other hosts, here it is also reported the number of animals in the outbreak and the number who died.The structure of the csv file included in this folder, namely "wn-ita-sorveglianza-equidi-yyyy.csv"(where yyyy is the year of monitoring), is described in Table 5.Table 5: Structure of the file "wn-ita-sorveglianza-equidi-yyyy.csv"within the folder dati-sorveglianza-equidi.
Birds surveillance data (dati-sorveglianza-uccelli) This folder contains weekly cases of WNV infections in birds at the province level.For monitoring purposes, cases in birds are reported separately for target and wild species.The choice on the part of ISS to monitor these two types of birds is as follows: wild species can act as reservoirs and amplifiers of viral infection, while target species are sedentary so they allow viral circulation to be determined in specific areas of regions.The folder includes two different csv files, one for each type of bird: (i) the csv file "wn-ita-regioni-sorveglianza-uccelli-bersaglio-yyyy.csv"(where yyyy is the year of monitoring) has data about three targeted species, i.e., Magpie (Pica pica), Gray crow (Corvus corone cornix), and Jay (Garrulus glandarius); (ii) the csv file "wn-ita-regioni-sorveglianza-uccelli-selvatici-yyyy.csv"(where yyyy is the year of monitoring) reports WNV infections of the wild birds.Note that WNV has been found in up to 52 different wild bird species across years, but this information was not reported for the 2021 monitoring.The structure of both csv files is reported in Table 6.Table 6: Structure of the files "wn-ita-sorveglianza-uccelli-bersaglio-yyyy.csv" and "wn-ita-sorveglianza-uccelliselvatici-yyyy.csv"within the folder dati-sorveglianza-uccelli.
At the time of publication, WNVDB contains more than 7, 000 records.To facilitate access to the whole available data, we created a unique dataset named "latest-wnv.csv"whose structure is reported in Table 7.The file has 12 number of columns and each row identifies the weekly cases of WVN infections by type of host at the province level.The idea was to provide researchers and practitioners with a compact version of the different surveillance data -excluding specific aspects, such as age and type of infection for human surveillance or species for bird surveillance -to allow for a quick comparison among cases by different type of host and to ease data visualization of such data to promptly identify the areas where the virus is more likely to spread in order to increase the effectiveness of prevention policies.

Technical Validation
In this section, we show how WNVDB could be exploited to gain further understanding of West-Nile virus spread dynamics.First, we use basic summary statistics to describe the available data by type of susceptible host and provide For the interested readers, we mention that WNVDB data at the province level can be easily joined with environmental factors obtained from the World Weather Online database, accessible at (https://www.worldweatheronline.com/).This demonstrates the interoperability of our dataset with other official data sources and show how to assess possible relationships between the spread dynamics and the environmental conditions.In particular, the literature suggests that the mean maximum temperature in Celsius ( • C), the mean total precipitation in millimeters (mm) and the mean wind speed (Km/h) affect WNV transmission in different contexts Min and Xue (1996); Reisen et al. (2004);Landesman et al. (2007); Paz and Albersheim (2008); Kilpatrick et al. (2008); Paz and Semenza (2013).
Overview of WNV infections in the Italian territory over the period 2012-2022 Since the beginning of systematic monitoring, i.e., September 27, 2012 until the most recent update on November 1, 2022, Italy counted a total of 1576 WNV infections, with a yearly average of 143 cases.However, as it is shown by Figure 3a, all outbreaks were mild except the ones of 2018 and 2022, for which a total of 581 and 599 have been recorded, respectively.Excluding these two years from the computation yields a yearly average of only 44 cases, corresponding to < 1 case each million residents.More than 40% of human infections have been registered in Veneto ( 713), especially in the provinces of Padova (299) and Venezia (150), while almost 25% in the region of Emilia-Romagna (407), with half of them coming from the province of Modena (118) and Bologna (97).This data highlights the importance to control other possible interrelated factors such as vector infections and environmental conditions of the infected areas.Figure 3b shows the classes each year.People over 75 are the ones mostly affected by symptomatic infections, as they are likely to have other age related comorbidities.Regarding the type of infections, we have that symptomatic infections (WN-s) amounts to 1417, excluding blood donors that are all asymptomatic, but including both febrile (WN-f) and neuro-invasive (WN-n) cases; in particular, we note that WN-n pathologies are prevalent almost every year -accounting up to 87.5% of the total WN-s in 2014 -except for 2018 (which is also the year with the largest absolute number of WN-s).
Data about animal infections are only available for the last six years (2017-2022), when Italy introduced a specific prevention legislation at both the national and regional level.In particular, Figure 4a shows that the mosquitoes infections mostly hit the three regions of Northern Italy (Emilia-Romagna, Veneto and Lombardy) with some exceptions for Friuli-Venezia Giulia in 2020 and Piemonte in 2022.Indeed, ≈ 50% of WN mosquitoes infections are observed in Emilia-Romagna each year, ≈ 25% in Veneto, ≈ 8% in Lombardy, while the remaining 17% is spread out among Friuli-Venezia Giulia, Piemonte and Sardegna (only 4 cases in 2018 and 1 case in 2022).
Concerning equids, the majority of the infections were detected in Veneto and Lombardia, where the horse breeding are more frequent in the territory (see Figure 4b).Substantial decrease of WN cases in horses can be observed in Veneto after 2019, perhaps highlighting the effect of a wide campaign of prevention carried out by the region to raise awareness among horses' owners.The higher number of horses' infections in Lombardia region is instead observed in 2019 and 2020.
Eventually, Emilia-Romagna detected the largest number of birds infections in almost all years (see Figure 4c), reflecting the similar pattern of mosquitoes infections.This higher detection rate, if compared with other regions, can be due to a better application and adherence to the parameters of birds sampling in application to the regional Arbovirosis prevention law, and higher awareness of the regional governors.In 2022 the detection frequency is quite similar for Veneto and Emilia-Romagna, reflecting the higher detection rate both in mosquitos and in human cases.Richards' growth Generalized Linear Model Cumulative incidence of severe epidemic outbreaks typically exhibits an S-shaped trend: the onset of the outbreak anticipate an exponential growth phase whose acceleration usually softens after the implementation of one or more prevention policies (e.g.lockdown, vaccination campaign, etc.), eventually reaching an asymptote that can either be constant if the virus is eradicated or time-varying if the virus goes endemic.
From another perspective, the first differences of such cumulative incidence indicators generally exhibit a bell-shaped behaviour (wave), still highlighting the different phases of the epidemic growth pattern.Borrowing from a well-known tool in the biological literature, we here use the Richards' curve Richards (1959) to model weekly cases of WNV at the regional level for the two most severe outbreaks of 2018 and 2022Riccardo et al. ( 2022).Nevertheless, we argue that this model specification can also be used to model smaller epidemic outbreaks in different regions.In particular, this curve is specified by 5 parameters able to characterize most of the features occurring in epidemiological data, such as the final epidemic outbreak size, the infection growth rate, the peak position (i.e. when the curve growth speed slows down) and the slopes of the ascending and descending phase of the outbreak Tjørve and Tjørve (2010).
For each year and for a given spatial aggregation level, let us denote the weekly time-series of cumulative West Nile cases by {y c t } T t=1 .We model the expected value of the cumulative counts independently for each year and geographical unit by assuming that its expected value follows the extended Richards' curve Mingione et al. (2022);Richards (1959), which can be defined as: where γ T = [b, r, h, p, s] is the 5 parameters vector including: b ∈ R + a lower asymptote, r > 0 the distance between the upper and the lower asymptote, h the hill (growth rate), p ∈ (0, T ) the inflation point, and s ∈ R the asymmetry parameter.Note that each cumulative counts can be defined as: where Y 0 = 0 without loss of generality.In other words, no matter the time scale, cumulative counts at time t are obtained by summing all the new counts from the beginning of the monitoring period up to t.Looking at this relationship from the opposite side, new counts at each time t are the result of the difference between cumulative counts at time t and t − 1. Exploiting the linearity property of the expected value, from (1) we can easily derive the expected value of new counts at each time point as: (2) We do so to allow for the direct modelling of new counts instead of cumulative ones.Richards' parameter estimation are summarized in Table 8.Looking at the results of the model for the 2018 outbreak, it is possible to highlight a satisfactory goodness-of-fit for both regions, with R 2 ≈ 0.91 and R 2 ≈ 0.61 for Emilia Romagna and Veneto, respectively (see left panels of Figure 5).As clear from Figure 5c, the management and collection of the data in Veneto is very heterogeneous, and this would affect the uncertainty surrounding our estimates.Data issues are well-known and widely discussed in the literature Jona Lasinio et al. (2022).However, the Richards' curve is able to capture the average outbreak behavior.In detail, the estimated final epidemic size of the outbreaks in 2018 is equal to 0.4417 (CI 0.95 = [0.4396,0.444] and 0.9104 (CI 0.95 = [0.9103,0.9105] infections each 1000 residents in Emilia-Romagna and Veneto, respectively, with Veneto experiencing a more at-risky situation.This is not the only difference between the two regions, as the spread of the epidemic follows a smoother behavior in Veneto than in Emilia Romagna (see Figure 5a), with the parameter governing the infection growth rate being ĥ = 0.8082 (CI 0.95 = [0.8081,0.8083] and ĥ = 0.4084 (CI 0.95 = [0.4084,0.4085].At the same time, however, the decreasing phases is also smoother and slower in Veneto than in Emilia-Romagna, leading to a higher overall outbreak's size.The idea that Veneto is more affected than all other Italian regions is confirmed by the analysis of the 2022 outbreak.On the other hand, the estimated final epidemic size of the more recent outbreaks in 2022 is equal to 0.30 (CI 0.95 = [0.019,0.87] and 0.5842 (CI 0.95 = [0.5842,0.5842] infections each 1000 residents in Emilia-Romagna and Veneto, respectively.The goodness-of-fit is still acceptable and consistent over outbreaks with R 2 ≈ 0.56 and R 2 ≈ 0.84 for Emilia-Romagna and Veneto.As clearly shown in Figure 5b, the size of the outbreak in Emilia-Romagna is much smaller and small changes in the number of cases could lead to a more uncertain estimate of the overall outbreak size.Given the different sizes, the characteristics of the region-specific outbreaks are rather similar to those from 2018.To remark that the Richards' curve can be used not only to describe the outbreak but also to obtain short-term forecasts, we estimate the outbreak behavior leaving out the last three observations and then check if they are included in our forecast intervals.
For both regions, we are able to forecast short-term events with a reasonably small uncertainty.This result can be used for future outbreaks, to monitor the evolution of the outbreak and to timely plan interventions, if required.Note that the same approach is valid for the spread of the WNV among animals too, as a further justification of the usefulness of our proposal.

Usage Notes
Our dataset is designed to promote rapid, objective, and consistent epidemiological reading of available data.This allows, for instance, to monitor the epidemiological trends of the West Nile outbreaks in Italian regions and provinces with informative graphical outputs and derive insights with predictive modeling.To facilitate other research works and ease the utilization of WNVDB, all data are stored in a GitHub repository accessible at https://github.com/fbranda/west-nile and released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, allowing users to share, copy and redistribute the material in any medium or format and to adapt, remix, transform and build upon it for any purposes.At the same link, metadata and supplementary materials are provided to better understand the dataset.As previously mentioned, this dataset was initially compiled with all data up to 2022, but will be continuously updated (and eventually adapted) as soon as new bulletins will be published.Usually, every year, the monitoring period of West-Nile cases ranges from end of May to November.Finally, following collaboration with the Agency for Digital Italy (AgID) to aggregate the database into a single portal, it can be preview and download files of interest at the link https://dati.gov.it/view-dataset/dataset?id=32a9ef72-ec68-4f47-8df9-ca65c2a0a125.
Our goal is to reach as many people as possible, from the individual citizen to the professional, to evolve the current Italian open data portal into a system that provides tools and services developed and shared with the community to extend its potential.

Figure 1 :
Figure 1: Schematic overview of the key passages to build the WNVDB open-access database.

Figure 3 :
Figure 3: Monthly time-series of WNV human infections (a) and yearly distribution by age-class (b) in the Italian territory.
Figure 4 reports the yearly distribution of West Nile cases by region for each animal host: (a) mosquitoes, (b) equids, (c) birds.
It is indeed not trivial to specify a suitable probability distribution that ensures the natural constraint they must respect, i.e.Y t ≥ Y τ , ∀ τ ≤ t.On the other hand, by assuming stochastic independence, conditionally on λ γ (t), between the new counts at the time t and all the previous observed values, i.e.Y t | γ ⊥ Y c τ , ∀ τ < t, we can easily express the likelihood of any suitable probability distribution for count data, such as the Poisson or the Negative Binomial.Embedding the Richards' specification in the expression of the expected value of the observed counts defines an extended GLM framework that is flexible enough to model several growth curves, such as the one of WNV cases.Parameters are estimated by numerical maximization of the log-likelihood as analytical solutions are not available in closed form.Further insights on the model implementation and estimation are provided in the seminal paper by Alaimo Di Loro et al. (2021), as well as additional details on how to compute the forecast and related uncertainty.

Table 7 :
Richards (1959)e file "latest-wnv.csv".ageneralpicture of the virus in the Italian territory during the whole study period.Particular attention is devoted to the yearly distribution by age class (in humans) and region (in animals) of WNV weekly cases to evaluate possible trends in the spread patterns.Then, we model the yearly WNV outbreaks in 2018 and 2022 for the two most affected Italian regions, i.e.Veneto and Emilia-Romagna using the Richards' curveRichards (1959).This curve, also known as Generalized Logistic Function, is very flexible and allows to characterize the main features of an epidemic outbreak (e.g. Mingione et al. (2023)owell et al. ( , 2017))vide short-term forecasts of WNV evolution.This proposal already proved successful in modelling other epidemic outbreaks, such as the one of SARS, Dengue, Zika, Monkeypox and EbolaZhou and Yan (2003);Hsieh et al. (2004);Hsieh and Ma (2009);Chowell et al. (2016Chowell et al. ( , 2017));Mingione et al. (2023).