Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy

The air in the Lombardy region, Italy, is one of the most polluted in Europe because of limited air circulation and high emission levels. There is a large scientific consensus that the agricultural sector has a significant impact on air quality. To support studies quantifying the role of the agricultural and livestock sectors on the Lombardy air quality, this paper presents a harmonised dataset containing daily values of air quality, weather, emissions, livestock, and land and soil use in the years 2016–2021, for the Lombardy region. The daily scale is obtained by averaging hourly data and interpolating other variables. In fact, the pollutant data come from the European Environmental Agency and the Lombardy Regional Environment Protection Agency, weather and emissions data from the European Copernicus programme, livestock data from the Italian zootechnical registry, and land and soil use data from the CORINE Land Cover project. The resulting dataset is designed to be used as is by those using air quality data for research.


Background & Summary
Air pollutants may be categorised as primary or secondary.Primary pollutants are directly emitted to the atmosphere, whereas secondary pollutants are formed in the atmosphere from precursor gases through chemical reactions and microphysical processes.One of the key precursor gases for secondary particulate matter (PM) is ammonia (NH 3 ).This holds true for both large PM with aerodynamic diameter less then 10µm (PM 10 ) and fine PM with aerodynamic diameter less then 2.5µm (PM 2.5 ).There is a large scientific consensus that livestock and fertilisers are responsible for ammonia emissions [1,2].In Europe, around 90% of ammonia emissions originate from the agricultural sector [3], while, in the Italian Lombardy region, shown in Figure 1, up to 97% of ammonia emissions are linked to the agricultural sector [4].According to Lombardy Regional Environment Protection Agency [5], ammonia is responsible for 60% of PM 10 concentrations in Lombardy under specific conditions.This paper presents an open access spatiotemporal dataset, named the Agrimonia dataset [6], which includes several environmental variables that have been harmonised at the same spatial and temporal resolution.
The dataset has been developed within the AgrImOnIA project framework (Agriculture Impact On Italian Air https://agrimonia.net) which aims at assessing the role of the livestock sector on the air quality in the Lombardy region.The Agrimonia dataset provides the user with a dataset with emissions data, including ammonia, agricultural information and air pollution in a common table.In general, handling spatiotemporal data from several sources is a challenge faced by many research fields and represents an interdisciplinary topic.
The Agrimonia dataset could be useful for other researchers, for example, for the comparison between urban air pollution and rural air quality [7,8].Other uses of the dataset may move toward the study of different livestock management techniques and organic products [9] or for epidemiological studies, which aim to assess the impact of agricultural emissions on the mortality attributable to air pollution [10].Additionally, the land use and land cover variables included in the Agrimonia dataset, provide indices of the urbanisation degree [11], ecosystems conservation [12], and natural resources exploitation, allowing for assessing the degree of local sustainable development [13].
The rest of this paper is organised as follows: in the Methods Section, we describe the data sources and the transformations applied to harmonise the dataset, as well as the methodologies used to impute missing data and handle negative values.In the Data records Section, we describe our dataset and the various associated metadata files.Finally, the quality of the dataset and the method adopted are discussed in the Technical validation Section.

Methods
The Agrimonia dataset includes satellite data, model output and in situ measurements with different spatial and temporal resolutions from national and international agencies.Therefore, to combine the different datasets, a processing step is necessary.The remainder of this section describes the data sources and the harmonisation process applied to the different input data to make them homogeneous in time, with a daily resolution, and space, at the air quality station level.

Source data description
The data presented are related to five dimensions: air quality (AQ), weather and climate (WE), pollutants' emissions (EM), livestock (LI) and land and soil characteristics (LA).Because geostatistical methods can use neighbouring territory information [14] for improving the overall predictive capability close to the borders, we take into account an area around Lombardy region by applying a 0.3°buffer over the regional borders as shown in Figure 2. The neighbouring area intersects several regions.The various data sources used to create the Agrimonia dataset are summarised in Table 1 and described in the following subsections that detail spatiotemporal resolution and availability.[5,23], but not formally validated under the Figure 2: Administrative Lombardy region (blue boundaries) and augmented region (pink boundaries).A 0.3°b uffer is applied to the regional boundaries to create the neighbouring area.The measurement stations of the dataset are displayed as cyan-coloured circles for the stations in Lombardy and green-coloured squares for the stations in the neighbouring area, respectively.The resulting network takes 141 stations for a total of 540 sensors.Station named 'Corte de Cortesi' (marked as a red circle) is used as a reference throughout the paper sections.
same EEA protocols.In this work, data from ARPA are collected using the ARPALData package written in R language and available on CRAN (version 1.2.3) (https://cran.r-project.org/web/packages/ARPALData/index.html,accessed on 5 February 2022).The data used for the Lombardy neighbouring areas come from the open access service of the EEA (https://www.eea.europa.eu/themes/air,accessed on 5 February 2022).To get an overview of AQ data used, Table 2 summarises the pollutants selected, their sources and the number of sensors available for each pollutant.
At each AQ station concentration data of possibly different subsets of pollutants are gathered.For each AQ station, pollutants, spatial location, altitude, station type and other information are available in the station registry metadata file named 'Metadata monitoring network registry.csv'provided with the Agrimonia dataset [6].EEA and ARPA classify air quality stations according to the land use and emission contexts [16].Namely urban (U), suburban (S), rural (R) for the former and background (B), traffic (T) and industrial (I) for the latter.In particular, the S = 141 stations of the augmented Lombardy region are classified as 42 (UB), 3 (UI), 36 (UT), 25 (SB), 4 (SI), 1 (ST), 18 (RB) and 2 (RI).

Weather
Meteorological data are obtained from the Copernicus Climate Change Service (https://climate.copernicus.eu/, accessed on 27 April 2022) through the ERA5 datasets containing the numerical model output computed by the European Centre for Medium-Range Weather Forecasts (ECMWF).ERA5 is the fifth generation ECMWF reanalysis of the global climate for the past decades.The reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of atmospheric science.The ERA5 datasets used here are ERA5-Single level [17] and ERA5-Land [18].An overview of all ERA5 subdatasets can be found in the official ERA5 data documentation (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation, accessed on 27 April 2022).ERA5-Single level provides hourly estimates for a large number of atmospheric and land-surface quantities with a regular grid scheme at various atmosphere levels.ERA5-Land provides near-surface variables over several decades at an enhanced resolution compared to the ERA5 regular latitude/longitude grid of 0.25°× 0.25°and 0.1°× 0.1°, respectively.To get an overview of WE variables selected from ERA5 datasets, Table 3 summarises the WE variables, their sources, descriptions and units.
Relative humidity is useful for studying air quality, so we complete the weather variables by calculating relative humidity.Using the temperature (T ) and dew point temperature (T dew ), we compute the relative humidity (RH) using the August-Roche-Magnus approximation formula [24]:

Emissions
The Copernicus Atmosphere Monitoring Service (CAMS) implemented by the ECMWF is one of the most recent global databases covering anthropogenic source emissions.CAMS datasets are compiled emission inventories for many atmospheric compounds developed for the years 2000-2022 [25,19].These inventories are based on a combination of existing datasets and new information, describing anthropogenic emissions from fossil fuel use on land, natural emissions from vegetation, soil and more.The anthropogenic emissions on land are further separated into specific activity sectors (e.g.traffic, agriculture).Pollutant emissions data are provided by the CAMS-anthropogenic emissions dataset (https://permalink.aeris-data.fr/CAMS-GLOB-ANT,accessed on 27 April 2022), which contains monthly global anthropogenic and natural emissions from 36 sources on a regular grid level.The anthropogenic sources are divided into 20 sectors (including agriculture and livestock) with a spatial resolution of 0.1°× 0.1°.Table 4 summarises the emission variables selected, their origins, descriptions and units.

Livestock
Slurry and manure have the highest concentration of nitrogen (N ) in a soluble form, which is a precursor of ammonia gases.The animal categories that are the most responsible for N emissions are swine and bovines [26], for this reason only the data related to these species are included in the Agrimonia dataset.Information about livestock is obtained by the Italian National Data Bank of the Zootechnical Registry (BDN) [20].The BDN dataset is derived from the livestock census and includes information on the animal population of zootechnical interest present in Italy, its distribution in the territory and its characteristics.

Categorical
Table 3: WE variables selected from ERA5 datasets.
February 2022), translated as Direzione Generale della Sanità Animale e dei Farmaci Veterinari of the Italian Ministry of Health and represents the official source of data on livestock, both for the control authorities and users.The BDN dataset is accessible through the "statistics" section of the BDN portal (https: //www.vetinfo.it/j6_statistiche/index.html#/, accessed on 15 February 2022).The BDN data are updated every six months and aggregated at the municipality level.From BDN, we take the municipal number of swine and bovines for the augmented Lombardy region.Figure 3 shows the number of swine and bovines in the augmented Lombardy region, which are both particularly high in the Southeastern areas.

Land
Land cover, land and soil use are considered important factors to assess the agriculture impact on PM [1,27].The data about land are retrieved from different sources: for land cover variables (only high and low vegetation index) we use the ERA5-Land [18] dataset already introduced in the Weather Section.The land use variables are provided by the Corine Land Cover (CLC) [21] dataset while the soil use variables are given by the Lombardy Region Agriculture Information System (SIARL) [22] 5 summarises the land cover, land and soil use variables selected from the ERA5-Land, CLC and SIARL datasets, respectively.In the metadata files provided with the Agrimonia dataset [6], the class labels for CLC and SIARL datasets are available in the files named 'Metadata LA CORINE labels.csv',for the CLC classes, and 'Metadata LA SIARL labels.csv'for the SIARL classes.

Data harmonisation and processing
Since the previous section discussed the input data's different spatial and temporal resolutions, we introduce here the methods used to harmonise the data before merging them into the Agrimonia dataset.We also consider missing data imputation and some variable transformations.In the metadata file named 'Metadata Agrimonia.csv',provided along with the Agrimonia dataset [6], columns 9 -14 summarise the original spatial and temporal resolutions and the transformations converting the variables to the same spatial and temporal resolution, i.e., daily quantities at the station level.

Air quality
The AQ data, from ARPA and EEA networks, have some issues related to missing measurements; that is, they have both short and prolonged periods with stations turned off for maintenance, instrument calibration or other reasons.Furthermore, measurements could be taken at irregular intervals, or a sampling policy change could take place.Measurements not validated by the environmental agencies and negative values are considered as missing values ('NaN').
As reported in Table 2, both daily, bi-hourly and hourly AQ measurements are present in the network.Since some time series are hybrid (i.e. they have different time resolutions) we have considered the distribution of the time gaps between the measures.A time series with constant temporal resolution (e.g.daily) has a unimodal gap frequency distribution.Vice versa, a time series with a hybrid time resolution results in a bimodal gap frequency distribution.
Since the presence of several missing values in a day could introduce a bias in the daily average, we implemented the following algorithm.For each hourly and bi-hourly time series, missing values are imputed using a state-space model [29] and the relative Kalman smoother [30] which provides an estimate of the missing data and their uncertainty.Next, hourly and bi-hourly time series are averaged over each day.Days with a gap larger than six hours are set to missing.The Kalman smoother uncertainty associated with the hourly estimate is propagated to the daily average, thus providing daily uncertainties due to the missing data imputation.Details on this approach are discussed in the Technical validation Section.The resulting imputation uncertainty is reported in the metadata file 'Metadata AQ imputation uncertainty.csv'provided along with the Agrimonia dataset [6].[6].

Weather
This section describes in detail the harmonisation process for the WE variables (see Table 7).The data about WE come from ERA5 datasets and are given by hourly reanalysis estimates in a regular grid format.It should be noted that the ERA5 value refers to the grid cell average, while the coordinates refer to the centre of the cell.Because the data stem from a model, there is no problem with missing and negative values.Some variables need to be preprocessed to be more informative.We summarise the two preprocessing steps for the original variables, as follows: the wind speed is calculated as the Euclidean norm of the wind vector with u-and vcomponents; the wind direction is discretised using the classical 8-wind rose: North (N), North-east (NE), East (E), South-east (SE), South (S), South-west (SW), West (W), North-west (NW); the temperature is converted from Kelvin to Celsius degrees.The transformation of weather variables to create daily time series is composed of two different stages.The first one is to create an hourly weather time series related to each AQ monitoring station, while the second consists of computing daily time series from hourly time series.The first step is necessary because the AQ station is misaligned concerning weather data, as shown in Figure 4a.To associate weather time series to each AQ station, we use the inverse distance weighted (IDW) interpolation algorithm [31].The IDW algorithm is based on the Euclidean distance between the localisation of the stations and grid cells' centres.For each station, we consider the four nearest grid cells.The IDW power parameter, which controls the weight of the cell values on the interpolated values based on their distance from the localisation of the station, is set to one.After the hourly time series is created using the IDW approach for each station, we convert the temporal resolution from hourly to daily using different ensemble functions according to the variable type [32].Table 7 lists the weather variables in the Agrimonia dataset while Figure 4b shows an example of time series obtained using the IDW approach.4a).

Emissions
This section describes the harmonisation process for the EM variables summarised in Table 8.The data about EM are from the CAMS datasets with a monthly temporal resolution and on a regular grid.As done for the WE variables, we performed a two-step transformation process to create daily emission time series.In the first step, we use the same IDW approach described for WE variables to compute emission values related to each monitoring station with monthly resolution.In the second step, we use spline interpolation techniques to convert the series to the same daily temporal resolution.To avoid oscillations, overshoots, edge effects and negative values, we use piecewise cubic Hermite interpolating polynomials (PCHIP) [33].As discussed in more detail in the Technical validation Section, this method interpolates the data smoothly, while retaining the data's shape and monotonicity.For each EM variable we use IDW function as spatial transformation and PCHIP interpolation function to transform form monthly to daily temporal resolution.More detail on the transformation process can be found in the metadata file named 'Metadata Agrimonia.csv'available with Agrimonia dataset [6].

Livestock
In this section, we describe the harmonisation process for the LI variables summarised in Table 9.The data related to the livestock sector are retrieved from the BDN dataset, which provides the number of bovines and swine aggregated at the municipality level.The BDN dataset is updated every six months, in June and in December.As a result, for each municipality, a time series of 12 values are available.Each AQ station is associated with the time series of the municipality to which it belongs, see Figures 5a and 5b for swine and bovines, respectively.Due to the particular municipality shape, the station named 'Vallelaghi T1191A' is within a municipality whose centroid is outside of the considered augmented domain, therefore the value of the closest municipality in the area considered is taken.
As done for the CAMS data, the PCHIP interpolation is used to increase the temporal resolution from biannual to daily (for more details, see the Technical validation Section).Once the interpolation function has been chosen, it is possible to evaluate it over the entire time horizon, particularly for all daily instants.To reduce edge effects, we use the value for December 31, 2015, as the first starting value for the time series.Subsequently, the municipal animal density is calculated by dividing the animal count by the area of the station municipality (expressed in km 2 ).In this way, we obtain the daily time series of swine and bovines density for each monitoring station.See Table 9 for a summary of the livestock variables in the Agrimonia dataset.

Land
This section describes in detail the harmonisation process for the LA variables, which are summarised in Table 10.The data related to land cover and land and soil use are given by the ERA5-Land, CLC and SIARL datasets, respectively, describing the land and soil over time.Considering that high and low vegetation indices

LI
Municipal density of swine related to AQ stations

LI bovine
Municipal density of bovines related to AQ stations Table 9: LI variables in the Agrimonia dataset with daily temporal resolution.Information on the number of swine and bovines is expressed as a density with respect to the municipal area: number/km 2 .More detail on the transformation process can be found in the metadata file named 'Metadata Agrimonia.csv'available with Agrimonia dataset [6].
from ERA5-Land have a daily resolution over a spatial regular grid, we use the same IDW approach described for WE variables to create the daily time series associated with each AQ station.Information about land use is relatively stable over time.For the CLC dataset, we take the 2018 data and keep them constant for the period from 2016 to 2021.For the SIARL dataset, the values are annual until 2019 (for more detail on land use and land cover, see the Technical validation Section).The CLC provides categorical data in polygons while SIARL does so on a regular grid.In both cases, each AQ station is associated with the polygon or the cell to which it belongs.In the metadata files provided along with the Agrimonia dataset [6], the class labels for CLC and SIARL datasets are available in the files named 'Metadata LA CORINE labels.csv'and 'Metadata LA SIARL labels.csv',respectively.In this way, we obtain daily piecewise constant functions for land cover and soil use associated with each AQ station.

Data records
The output dataset has been built by joining the daily time series related to the air quality (AQ), weather (WE), emission (EM), livestock (LI) and land (LA) variables discussed in the previous sections and referred to the same AQ monitoring station for the Lombardy region augmented by the 0.3°buffer, depicted in Figure 1.
The dataset and the metadata files are available on the Zenodo repository [6] as follows: • Agrimonia Dataset.csv: this is the Agrimonia output dataset joining the daily time series, at station locations, related to the AQ (see Table 6), WE (see Table 7), EM (see Table 8), LI (see Table 9) and LA (see Table 10) variables.In order to simplify the access to variables in the Agrimonia dataset, the variable name starts with the dimension of the variable, e.g., the name of the variables related to the AQ • Metadata Agrimonia.csv: this is the main Agrimonia metadata file and provides further information for the sources used, variables imported, transformations applied and Agrimonia variables.
• Metadata monitoring network registry.csv: it contains details about the AQ monitoring stations including station type, NUTS3 code, environment type, altitude, monitored pollutants and others.Each row represents a single sensor.
• Metadata AQ imputation uncertainty.csv: it contains the estimate of the daily uncertainty due to missing data imputation for the AQ time series.In particular, for each AQ variable, days without missing hours have zero uncertainty, days with one or more imputed hours have a number resulting from the propagation of the uncertainty in the daily averaging, and days with a 'NaN' in the concentrations have a 'NaN' also in the uncertainty.
• Metadata LA CORINE labels.csv: it contains labels and descriptions associated with the CLC land variables (column CORINE code).
• Metadata LA SIARL labels.csv: it contains labels and descriptions associated with the SIARL land variables (column SIARL code).
The Agrimonia dataset is consistent with the following reference systems: World Geodetic System 1984 (WGS84) [34] for geo-referentiation, Coordinated Universal Time (UTC) for time referentiation and the International System of Units (SI) for the metric system, except for the temperature expressed in Celsius degrees ( • C).The values in the dataset are represented in scientific notation with 4 significant digits here considered sufficient for statistical modelling purposes.The coordinates (latitude, longitude) have a fixed point representation with 9 significant digits for the correct identification of the stations' locations.

Technical validation Validation of AQ imputation methods
The missing values imputation for the hourly and bi-hourly time series is performed using the State-Space Model (SSM) [29] and the relative Kalman smoother [30].For any hour t ∈ {1, • • • , 24T }, where T is the number of days as before, let x t be the scalar state describing the dynamics of the underlying AQ "true" concentrations and let y t be the scalar observation of the observed hourly AQ series.Moreover, let u t and t be Gaussian white noises with unit-variance representing the innovation and measurement error, respectively, with u t and t uncorrelated.The SSM here used for missing imputations is defined by: where the α and b parameters describe the dynamics and the additive error structure on the state x t , respectively.Both α and b are estimated for each hourly time series using numerical optimisation of the likelihood function with initial values set to one.Assuming the hourly state errors x t − xt uncorrelated, we propagate the imputed uncertainty given by the smoother, through the mean of the generic day (d) as σ d = 1/24 A validation experiment concerns the station called 'Bergamo Via Meucci'.ARPA Lombardy provides the data at both hourly and daily resolution for this station.The hourly time series is longer but has several missing items.So, we verify the performance of the missing imputation process by comparing the daily data obtained through the Kalman smoother and the daily data provided by the agency.Figure 6a shows the two daily time series: the blue line depicts our method and the orange crosses are the ARPA Lombardy data.Figure 6b shows the imputed uncertainty of daily average concentrations computed from imputed hourly time series.It can be observed that the time series obtained by our missing imputation method is very close to the daily one, with the Root Mean Square Error (RMSE) equal to 0.1710.Another important experiment concerns the stations that sample pollutants bi-hourly.In this case, if the sampling frequency is regular during the day, we do not expect bias problems in the associated daily time series, although we still have 12 missing values for each day spread every other hour.This situation occurs, for example, in the station called 'Vigevano Via Valletta' In fact, the data on PM 2.5 concentrations are available with a bi-hourly frequency in the second half of the year 2021.Figure 7a shows the daily mean (blue dotted line), computed without considering missing values, compared to the daily mean computed with our approach only for 2018 in the study period (2016)(2017)(2018)(2019)(2020)(2021), in this work we assume the land use to be constant.This seems appropriate given that the overall area used by various sectors (such as agriculture and infrastructure) changes slowly over time.Also, this assumption is consistent with the fact that the AQ network stations have constant station types.Instead, soil use changes faster because the type of cultivation on the ground can be rotated for greater yield.As an example, Figure 9 shows the soil use provided by the SIARL dataset for the station named 'Corte dei Cortesi'.The figure shows that the type of cultivation near the station has changed over time.

Usage notes
This paper presents an open access spatiotemporal dataset, named Agrimonia dataset [6], which provides the user with a dataset about ammonia (NH 3 ) emissions, agricultural information, air pollution and meteorology in a common spatiotemporal resolution.The dataset is ready to be used 'as is' by those using air quality data for research.The Agrimonia dataset and metadata can be accessed through Zenodo with DOI: https: //doi.org/10.5281/zenodo.6620530.In the same repository, metadata and supplementary materials are provided to better understand the dataset.As previously mentioned, this dataset was initially compiled for the AgrImOnIA project and will be updated when new variables will be available.

Figure 1 :
Figure 1: The localisation of the Lombardy region in Northern Italy surrounded by the Alps.

Figure 3 :
Figure 3: Number of swine (a) and bovines (b) in the augmented Lombardy region and neighbouring area, aggregated at the municipal level on 31 December 2021.

Figure 4 :
Figure 4: (a): The irregularly located AQ monitoring stations (cyan circles) and the weather grid centres (red '+' symbols).(b): Daily 2 m temperature time series (WE temp 2m) from 2016 to 2021 for the monitoring station named 'Corte De Cortesi' (see red point in Figure 4a).

Figure 5 :
Figure 5: Swine (a) and bovines (b) density over the augmented Lombardy region on 31 December 2021.The AQ stations (cyan circle) are spread randomly over the studied area.Each station is associated with the the municipality centroid to which it belongs (red start).Based on this, stations in the same municipality share the same municipality centroid so they have the same livestock time series.

Figure 8 :
Figure 8: Piecewise cubic spline, PCHIP and Makima interpolation methods applied to swine time series for the monitoring station named 'Corte de Cortesi'.

Figure 9 :
Figure 9: Piecewise constant function for soil use provided by SIARL dataset for the station named 'Corte de Cortesi'.Note that the SIARL dataset covers data up to 2019 only.The class labels for the SIARL dataset are available in the file named 'Metadata LA SIARL labels.csv'available with the Agrimonia dataset [6].

Table 1 :
Sources of the Agrimonia dataset.The AQ data are pollutant concentrations [µg/m 3 ] sampled at S = 141 ground-level monitoring stations, irregularly located over the augmented Lombardy region, as shown in Figure2.For the Lombardy area, AQ data are retrieved by the environment protection agency of the Lombardy region (ARPA Lombardy, hereinafter ARPA), while outside of Lombardy, AQ data are provided by European Environment Agency (EEA).Most of the concentrations listed in Table2, namely PM 2.5 , PM 10 , NO 2 , NO x , CO and SO 2 , come from the open data system of ARPA (https://www.arpalombardia.it/Pages/Aria/qualita-aria.aspx,accessed on 5 February 2022) and are validated under EEA protocols.Instead data about NH 3 come from experimental monitoring campaigns, implemented by ARPA according to laboratory best practices

Table 2 :
-Single level.ERA5-Single level and ERA5-Land datasets can be downloaded through the Climate Data Store portal (https://cds.climate.copernicus.eu/cdsapp#!/home,accessed on 27 April 2022) on a AQ pollutants concentrations [µg/m 3 ] data sources, descriptions, sampling frequency and number of available sensors.
The BDN dataset is managed by the Directorate General for Animal Health and Veterinary Medicines (https://www.salute.gov.it/portale/temi/p2_5.jsp?lingua=italiano&area=sanitaAnimale&menu=tracciabilita, accessed on 15 Type of precipitation on the Earth's surface.Values of precipitation type are: no precipitation (0), rain

Table 4 :
dataset.The CLC dataset, handled by Copernicus Land Monitoring Service (CLMS) (https://www.copernicus.eu/en/copernicus-services/land,accessedon27 April 2022), is available only for the year 2018 and consists of an inventory of land use in 44 classes within a minimum mapping unit of 25 hectares.The classes are organised in the hierarchical 3-level CLCWaste burningEmissions of NH 3 originating from agriculture waste burning kg/(m 2 s) EM variables selected from the CAMS-anthropogenic emissions dataset.

Table 5 :
High vegetation indexOne-half of the total green leaf area per unit horizontal ground surface area for high vegetation type m 2 /m 2 LA variables selected from the ERA5, CLC and SIARL datasets.

Table 6 summarises
AQ variables in the Agrimonia dataset providing name, spatial and temporal transformations and time resolution.

Table 6 :
AQ pollutants concentrations [µg/m 3 ] in the Agrimonia dataset.All AQ variables included in the Agrimonia dataset are harmonised to the daily time resolution.More details on the transformation process can be found in the metadata file named named 'Metadata Agrimonia.csv'available with Agrimonia dataset

Table 8 :
Pollutant EM variables [mg/m 3 ] present in the Agrimonia dataset with daily temporal resolution.

Table 10 :
[6]variables in Agrimonia dataset with daily time resolution.More detail on the transformation process and label for categorical variables can be found in the metadata files named 'Metadata Agrimonia.csv','MetadataLACORINE labels.csv'and'MetadataLA SIARL labels.csv',respectively,providedwith Agrimonia dataset[6].dimension starts with 'AQ '.Missing data are denoted by the 'NaN' value.The Agrimonia dataset has S = 141 monitoring stations and T = 2192 days between 1st January 2016 and 31st December 2021.The dataset is characterised by: S × T rows, each of which is uniquely identified by the pair station code and date in "YYYY-MM-DD" format.It contains 41 columns including the following block: header (stations' code, latitude and longitude, date and altitude), AQ (7 columns), WE (16 columns), EM (7 columns), LI (2 columns) and LA (4 columns).This file is made available also in the .matand .Rdata format for MATLAB and R software, respectively.
2 t∈d V ar(x t | y 1 , ..., y 24T ) where V ar(x t | y 1 , ..., y 24T ) is the variance of the smoothed states x t for the hour t considered during the daily mean.