Regulation-based probabilistic substance quality index and automated geo-spatial modeling for water quality assessment

Natural environments are recognized as complex heterogeneous structures thus requiring numerous multi-scale observations to yield a comprehensive description. To monitor the current state and identify negative impacts of human activity, fast and precise instruments are in urgent need. This work provides an automated approach to the assessment of spatial variability of water quality using guideline values on the example of 1526 water samples comprising 21 parameters at 448 unique locations across the New Moscow region (Russia). We apply multi-task Gaussian process regression (GPR) to model the measured water properties across the territory, considering not only the spatial but inter-parameter correlations. GPR is enhanced with a Spectral Mixture Kernel to facilitate a hyper-parameter selection and optimization. We use a 5-fold cross-validation scheme along with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2-score to validate the results and select the best model for simultaneous prediction of water properties across the area. Finally, we develop a novel Probabilistic Substance Quality Index (PSQI) that combines probabilistic model predictions with the regulatory standards on the example of the epidemiological rules and hygienic regulations established in Russia. Moreover, we provide an interactive map of experimental results at 100 m2 resolution. The proposed approach contributes significantly to the development of flexible tools in environment quality monitoring, being scalable to different standard systems, number of observation points, and region of interest. It has a strong potential for adaption to environmental and policy changes and non-unified assessment conditions, and may be integrated into support-decision systems for the rapid estimation of water quality spatial distribution.

and showing the ways to decrease handcrafting (i.e., through the automated kernel structure selection) and increase computational efficiency are of paramount interest and practical importance. It should be noted, that there are other ML tools potentially capable of obtaining similar results, e.g. Support Vector Regression 53 , neural networks [54][55][56] .These models may give reasonable results, however black-box approaches are usually reported to be difficult in interpretation. At the same time, GP benefits over the above mentioned approaches due to the results of GP applications are usually easier interpreted. Still, the comparison between different modeling frameworks to solve the multi-task problems might be the promising direction in advancing assessment approaches.
This paper presents a part of the project aimed to implement the ML techniques to the environmental monitoring issues. Considering the above-mentioned developments of the community, the objective of this research is to show an automated Geographical Information Systems (GIS) approach for the freshwater assessment and spatial modeling applied to the existing sample network based on the data including 1526 samples obtained from 448 unique points across the New Moscow region described by 21 parameters and spatial coordinates 57 .
The detected concentrations of parameters used in the modeling vary significantly across the territory. Some of them, e.g. Cl, NO 3 , PO 4 ions, as well as metal ions, Fe, Mn, Ni, may exceed established permissible limits 2-8 times while their content in other locations may be equal to 0. Our proposed modeling workflow is based on the multi-task Gaussian process regression (GPR) featured by the automatic kernel structure selection and hyper-parameter optimization. An important advantage of the developed approach is that it enables predicting the spatial distribution of all of the measured properties in one consistent procedure. Particularly, it considers both spatial and inter-parameter dependencies and allows to assess not only the precise values of water properties but also their probabilistic ranges as well as enables the accuracy control with the minimal user efforts. This modeling framework is supplied with a limit-driven assessment system: one can easily check whether the selected parameter is in the permissible range (considering the diapason, not only the upper limit) in the selected location. Considering the convenience of the joint concise characteristic to describe the overall system state, we propose a probabilistic substance quality index (PSQI). It incorporates both observations and the established regulations, denoting the probability that all of the characteristics lie in permissible diapasons. The presented approach is devised to meet the requirements to enhance reproducibility and fairness of the assessment. It is open to scaling to different standard systems, set of points of observation, region of interest and has a strong potential for adaption to environmental and policy changes and non-unified conditions of assessment. Therefore, it ensures direct integration into support-decision systems.

Modeling tools and water quality index
In the following section the key concepts and stages of the proposed assessment approach are described in detail. Firstly, the theory behind the Gaussian process regression is explained and the modeling objectives are formulated. In order to deal with one of the key difficulties of the natural systems quality assessment-representative consideration of the overall complexity in spatial modeling-the concept of multi-task Gaussian process regression is given. It is supplied with an explanation of the automatization procedure (hyper-parameter selection) and details for reproducibility: data pre-processing, model validation, and technical requirements to handle calculations. Finally, the definition of the proposed Probabilistic Substance Quality Index is given. The general scheme of the workflow is presented in Fig. 1.  www.nature.com/scientificreports/ Gaussian process regression. In order to perform geo-spatial modeling of multiple water properties from the collected dataset, we refer to the Gaussian process regression (GPR) framework 49 , known as kriging in geostatistics. Mean µ(·) and covariance (or kernel) k(·, ·) functions completely determine a Gaussian process: where E is a mathematical expectation and x ∈ R d is a vector of d input parameters, which are 2D coordinates in our case (for instance, represented in the Mercator projection). As an example, consider a simple GP model: where ǫ ∼ N(0, σ 2 ) accounts for noise in measurements, hence, helping to avoid model over-fitting. Given the training samples where N denotes the number of available measurements and (·) ⊺ denotes a transpose, the predictive distribution at arbitrary point x * can be found as where K x = k x (X, X) = k(x i , x j ), i, j = 1, . . . , N and k x * = k x (X, x * ) are spatial covariance matrices between all of the training points and between training points and the single prediction point, respectively; µ(X) = µ(x i ), i = 1, . . . , N is the mean vector-function evaluated at the training points; and I is an identity matrix. A choice of the mean and kernel functions depends on the assumptions about the model and the particular application. An example of a kernel function is a widely used Gaussian kernel, which corresponds to Gaussian variogram in kriging. The kernel hyper-parameters are usually optimized using Maximum Likelihood Estimation (MLE) 58 or its variations. Figure 2 illustrates an example of GPR using the Gaussian kernel and the constant mean over the sigmoid function with noisy measurements. Predictive variance increases notably at the points with missing measurements. Moreover, outside of the interpolation region, a predictive mean fails to capture the behavior of the underlying model due to the structure of its mean and kernel functions.
There are several issues that should be addressed to perform efficient modeling: • Basic approach allows modeling only a single output or multiple independent outputs, whereas our aim is to capture both geo-spatial and inter-feature dependencies at once. • Naive GPR computational requirements increase cubically with the dataset size, as it requires matrix inversion. • GPR model requires selection of multiple hyper-parameters, e.g., kernel and mean functions.
Multi-task Gaussian process regression. Let's consider a more complex model than in the Eq. (2): where y is a vector of M measured properties, ǫ ∼ N(0, D) with D being an M × M diagonal noise matrix. In order to capture both inter-feature and geo-spatial dependencies in covariance function construction, we refer to multi-task approach 59 : www.nature.com/scientificreports/ where K f is an M × M inter-feature covariance matrix and �·, ·� denotes a scalar product. Then, given the training data X = (x 1 , . . . , x N ) ⊺ ∈ R N×d , Y = y 1 , . . . , y N ⊺ ∈ R N×M , the predictive distribution at the unobserved point x * is found as where ⊗ denotes a Kronecker product, y and µ are flattened N · M-dimensional vectors obtained from N × M matrices y and µ(X) , respectively. As a "side-effect" of this approach, after the model is built and hyper-parameters are optimized, we can analyze the dependencies between modeled properties using matrix K f .
Decreasing computational complexity. One of the main disadvantages of the proposed multi-task GPR approach is that it induces a lot of additional calculations. In the naive case it becomes O(N 3 M 3 ) instead of O(N 3 ) , and to alleviate it, we turn to GPU computations. We perform modeling using Python programming language, GPy-Torch library 60 based on the PyTorch framework 61 and NVIDIA Tesla K80 GPU. However, to further improve performance and to easily account for the possibly correlated components, we parametrize covariance matrix K f using low-rank approximation as follows: where B is an M × r matrix, v is an M-dimensional positive vector and r is the supposed rank of K f .
Hyper-parameter selection. In order to simplify the hyper-parameter selection, we refer to Spectral Mixture Kernel 62 , which can be represented as: where Q is the number of components in the mixture, w q is a weight of the q th component, v Hence, we are able to model arbitrary stationary kernels and control the complexity with a number of components Q in the mixture. Depending on the size and structure of the input data, the number of model parameters may require certain tuning. The main advantage of this approach is that it does not require any handcrafting of the potentially effective kernels, but instead, enables automatic hyper-parameter selection and optimization. Since the kernel assumes stationarity, we use a quadratic polynomial in two variables as the mean function to eliminate potential trend in data: where c 0 , c 1 , c 2 , c 12 , c 11 , c 22 are vectors of size M also being optimized during the model training. Optimization of the hyper-parameters is performed with MLE approach by solving the following maximization problem numerically: where θ denotes all hyper-parameters of the model. Data normalization. Spatial coordinates first were converted from EPSG:4326 (latitude, longitude) format to EPSG:32637 (UTM zone 37N) and, then, scaled down to [0, 1] range using the min-max normalization.
The scaling of the measured properties required a more complex approach. All of the properties are limited from below by zero and from above with some reasonable values, e.g., concentration can not be more than 100%, and, as another example, pH values can lie only in [0, 14] range. Unfortunately, a direct GPR does not consider such limits on outputs as the predictive distribution is normal and has infinite support. To incorporate such bounds we follow a warping approach 63,64 , which allows to map measurements from bounded space to unbounded and apply GPR directly. First, let M-size vectors b L and b U denote bounds of parameters dictated by regulatory documents from the Table 1. For every modeled parameter without the explicit value domain (all, except for pH in our case) we calculate their maximum values across all of the measurements y max = max 1≤i≤N y i . Then, we choose the maximum between the obtained values and b U and multiply it by 10 (to certainly avoid out-of-bounds problem), thus, defining upper limits as 10 · max{y max , b U } . The boundaries for pH are set as [0, 14] and the obtained limits are used for the min-max scaling, mapping all of the parameters to [0, 1] region. Furthermore, to map the scaled parameters to an unbounded space we use the inverse cumulative-distribution function (ICDF) Validation. To validate and compare the trained models we apply a standard cross-validation scheme with 5 random splits, with 80% and 20% of a train and test data, respectively. For each split we (i) perform the model fitting on training data, (ii) obtain predictions for each modeled property for the test data points and (iii) calculate R 2 -score (or the coefficient of determination). This quality metric shows the proportion of the observed variation explained by the variation in the input data using the model. It is equal to one (1) if a predictive error is zero (0), zero (0) if a predictive error equals the test data variance, and negative if it is larger. The particular choice of the metric based on the Mean Square Error (MSE) is justified by our model selection. The predictive mean of GPR is tightly connected with a solution of Kernel Ridge Regression 65 , which involves the minimization of the exact MSE of the training data with additional regularization. As the model is intrinsically trained to minimize the Euclidean error, it is natural to use the MSE-based metrics for its evaluation. To select the best model we have to disregard the poorly modeled properties. Thus, we average R 2 -scores over all splits and remove all of the properties for which the maximum, calculated over every model, yielded a negative value. Then, we average the scores over all kept properties and splits and select the model with the highest value.
Probabilistic substance quality index. To evaluate the water quality we propose a new technique that takes advantage of GPR and allows to incorporate regulatory standards into assessment procedure directly. First, we note that the output of the GPR model is not just a vector of predicted values of properties, but a probabilistic distribution. Namely, at any location it gives a multi-dimensional normal distribution z ∼p(z | x * ) = N μ(x * ),�(x * ) described by the mean vector and covariance matrix from the group of Eqs. (6). Second, we appeal to the fact that the water quality is considered high if concentrations of different elements are located in admissible safe bounds, defined by governmental standards. Taking into account the above mentioned, we propose a measure coined Probabilistic Substance Quality Index (PSQI), which depicts the probability that all the measured properties will be within the admissible bounds. Therefore, it seems natural to integrate the probability density function p(z | x * ) over these bounds. Unfortunately, due to the "curse of dimensionality", an increase in the number of properties leads to a drastic decrease of the integral value and overall interpretability. Thus, we utilize the marginalization approach instead and define PSQI as follows: where w i denotes the importance of each individual property in water quality and b L , b U are admissible bounds for parameters from Table 1. By its construction, it is normalized to [0,1] interval, where zero value corresponds to zero possibility that properties are within the admissible bounds (bad quality) and one corresponds to the opposite (good quality). Since the integral in Eq. (11) does not have an analytical solution for the arbitrary bounds, we use SciPy library 66 to perform numerical integration of multivariate Gaussian probability density. Unfortunately, PSQI alone does not allow us to distinguish between the following cases of the index values being small: (a) predictive mean lies outside of the bounds and predictive variance is small (i.e., water is bad and we are certain about it); and (b) predictive mean lies inside of the bounds and predictive variance is large (i.e., water is good, but we are uncertain about it). To tackle this issue, we propose an additional confidence metric: Confidence is calculated similarly to PSQI, although with an important difference-predictive distribution is centered within the bounds. This way, the confidence value is high, if the standard deviation is much smaller than the bounds and low, otherwise. One can note that PSQI and confidence values are interconnected, e.g., if the predictive distribution is already centered within admissible bounds, then, they match. Figure 3 shows PSQI and confidence calculations for a single modeled parameter ( M = 1).
The selection of weights w i can be governed by the hazardousness of deficiency or excess of individual properties. Unfortunately, it may be not very straightforward to choose particular values of the weights with such an approach. Instead, we use the model training results to determine the "importance" of model properties for PSQI (11) PSQI conf www.nature.com/scientificreports/ computation. During training, we apply five-fold cross-validation and select the best model based on average R 2 -scores for each modeled property. It is worth noting that R 2 -score computed in transformed and original measurement space is different due to the non-linear nature of transformation (see in "Data normalization" section). Thus, we perform the model selection based on R 2 computed in the original state space. For some of the properties the model can perform poorly and yield negative R 2 -score values, which implies that the simple mean has a better predicting capacity than the model. In this case, the properties with non-positive R 2 -scores of their predictive means and variances are replaced with respective dataset means and variances, and all their inter-parameter correlations are considered zero. Further, to deal with the discrepancy of prediction accuracy among different properties, we propose to compute weights from Eqs. (11) and (12) using R 2 -scores and softmax function: where R 2 i denotes R 2 -score obtained for i th property during training. This way, we incorporate modeling accuracy into PSQI computations directly and reduce potential over-or under-estimations of the index. Moreover, it is still possible to incorporate ad-hoc importance of particular properties with an additional re-weighting.

Results
The source code, data, and results can be found in our repository 67 with available interactive visualization via kepler.gl platform.
To avoid serious over-fitting during the training phase, we bounded the corresponding length-scales of the mixture components within [0.1, 100] interval. We trained several models with different number of mixtures Q (from 1 to 5) in the spatial kernel (see Eq. 8) and different ranks r (3, 5, 7, 10, 15) of the inter-feature covariance matrix K f (see Eq. 7). To select the best model, we considered prediction accuracy only for 20 components, except for Cr as it yielded very poor results for every model. Figure 4 shows the average R 2 -score over every kept component and split for each model for different values of Q and r. It can be seen, that increasing model complexity does not necessarily lead to model improvements, as it can bring about over-fitting. Moreover, it typically causes time-performance degradation. The best model corresponds to a single mixture component and rank-10 covariance matrix. Training of the best model took 6.8 s, whereas evaluation over test data took 0.6 s. Figure 5a illustrates per-parameter performance with average, min, and max values of R 2 -score over splits for different properties obtained using the best selected model. We can see that some of the properties were predicted very poorly, such as Cr, Fe, Ni, Cu, NH 4 , NO 2 with the best average score of 0.635 for SO 4 . One of the reasons for the low accuracy may be a high level of noise in the data. It is accounted in the model and estimated during the training phase as an additive Gaussian component with the covariance matrix D from Eq. (4). Large noise values tend to correspond to a poor fit of the model given the training data for a particular water property. Unfortunately, it is not possible to illustrate the noise values in the original state space straightforwardly. The reason is that the modeling is done in normalized parameter state space (see in "Data normalization" section), therefore, in the original space normal noise is not obliged to be normal anymore. We illustrate it with Fig. 5b in a more comprehensive way, dividing the inter-quartile range for each noise component by the (13)  www.nature.com/scientificreports/ respective normalized normative range. For water properties without established restrictions, i.e., K, HCO 3 , Ca, relative noise is zero, and for poorly modeled properties relative noise is very high. As a "side-effect" of the model construction we obtained the optimized K f covariance matrix describing interconnections of water properties, for which, corresponding correlation matrix is shown in Fig. 6. General characteristics such as Alkalinity, Hardness and Mineralization rates are highly correlated with HCO 3 , Ca, Mg, correlation coefficients ( ρ ) lay in diapason from 0. 63

PSQI.
In order to calculate the PSQI values, we have used admissible bounds reflected in local regulations of the Russian Federation (see Table 1). Some of the parameters do not have any regulatory restrictions (Ca, K and HCO 3 ), thus, we excluded them from the calculation of PSQI to avoid overestimation. As could be noted earlier from Fig. 5, the prediction quality of our model differs across the properties and for some of them even yield negative R 2 -scores. Calculation of PSQI at a single point comprises two steps: (a) evaluation of the predictive distribution; (b) computations from Eqs. (11) and (12) (see in "Probabilistic substance quality index" section). We chose the best performing model during the validation stage, fixed all of its hyper-parameters, and used the whole dataset to make predictions at locations of interest. They were uniformly selected across the New Moscow region with 100 m 2 resolution, giving 151 447 points. The evaluation of the predictive distribution took 5.8 hours with approximately 130 ms per point. The subsequent computation of PSQI took only 95 seconds, which can be considered negligible. Figure 7 shows the spatial distribution of PSQI values obtained from predicted distributions, where outlined points denote collected samples. To additionally validate that PSQI indeed corresponds to the fraction of measured parameters being in admissible bounds, we, (i) evaluated PSQI at each sampling location, (ii) calculated such fraction directly for each measurement, and finally (iii) computed Pearson correlation coefficient between them, resulting in the reasonably large value of 0.68. The spatial distribution of PSQI confidence appeared to be of no practical interest, due to its very small scatter from 0.935 to 0.956 with 55% points yielding values less than 0.936.

Discussion
A typical modeling task consists of several pre-processing steps, one of which may be dimensionality reduction used to disregard the non-informative features or find the most "independent" combinations of the features to build the model for. It may help to decrease the computational complexity, but requires a careful dataset pre-processing and leads to information losses. Multi-task GPR allows to perform simultaneous geospatial modeling and capture the inter-feature dependencies while being able to control complexity using parametrization techniques. However, the computational overhead may reach as much as O (N 3 M 3 ) , thus, requiring further considerations of performance improvements 50,51,[68][69][70][71] .
One of key issues for geo-spatial modeling is the model construction process itself. On the one hand, manual selection of the kernel function based on domain knowledge allows to adapt to different areas of application. On the other hand, it limits automatization and scalability of the modeling significantly and causes some difficulty for integrating it into support-decision systems. Spectral Mixture Kernel facilitates the task of model construction, giving the ability to approximate arbitrary stationary kernels and control the accuracy with the number of mixtures. Since the number of mixtures is discrete, it can not be effectively optimized using MLE. In this case www.nature.com/scientificreports/ multiple models (for the different number of mixtures) can be trained and compared using the Cross-Validation technique to pick the best overall solution. However, the increase in the model complexity may lead not only to accuracy improvements, but on the contrary to the over-parametrization and degradation of both computational speed and the quality of predictions. Therefore, effective model construction still requires thorough consideration of both domain knowledge and the dataset structure. Freshwater characteristics usually depend on various factors such as the intensity of geological and hydrogeological settings due to dissolution processes and ion exchanges; seasonal fluctuations and climate change in global, being also affected by anthropogenic loads 72 . The modeled properties show reasonable correlations (Fig. 6), adequate for the natural freshwater resources and explainable for those influenced by urbanization and agricultural activity widespread across the territory of sample net. Ions of HCO 3 , Ca, Mg, SO 4 , Na, Mg, being major macro constituents, are always presented in groundwater, their concentrations depend mostly on the mineral composition of rocks, although surrounding lands as well 73 . The presence of nitrogen forms in the ground and surface water can accompany the agricultural and landfill sites, relating to the migration of products of fertilizers, pesticides or domestic sewage decay [74][75][76][77] . The observed correlations between NH 4 , NO 2 , NO 3 are ranged according to the stages of oxidation transformation of NH 4 . Additionally, substantial correlations www.nature.com/scientificreports/ between Cr, Fe, Cu, Ni, Mn, K, as well as between PO 4 and K also often are explained by the migration of fertilizers' degradation residuals across the landscape and into water sources. Apart from that, high intercorrelation between trace elements in the water samples can be linked to the objects with a high pollution potential, such as landfills, transport systems, industry. Among them, Ni demonstrates a remarkable negative correlation of -0.52 with pH, which can be explained by the reduced migration ability in alkaline conditions and can be supported by the noticeable correlation of -0.48 between pH and Alkalinity. Although more sophisticated variations of WQI have been recently proposed (e.g. based on multi-criteria decision analysis (MCDA) 28 , entropy-weighted indices 30 , modified by principal component analysis for optimized parameter selection 29 ), most of the existing WQI solutions have limitations, e.g. low independence of expertise and, as a consequence, site and case specificity as well as low robustness. In general, estimation of WQI includes the following steps: scaling of the selected parameters if they have different dimensions; selection of the most important parameters according to some rule, including an a priori knowledge; determining the relative weight of each parameter; calculating the sub-index of each parameter from the relative weight, and, finally, summarizing the results and determining the quality rating scale [78][79][80] . However, subjective judgments may cause certain confusion, e.g., in the determination of parameters importance, in the choice of the weight values for each parameter, in comparing the sum with expert-opinion-based ranges, as well as different limits scales of parameters.
As compared to the methodologies discussed above the proposed PSQI is directly linked to the established water quality guideline standards for each characteristic itself. PSQI does not rely on the subjective judgments about parameters' importance neither on the structure of the input data. Additionally, PSQI covers admissible limits not only as single values, e.g., when properties must be lower than specific upper bound, but allows to consider an optimal range, such as for pH or alkalinity. This is an important improvement of the currently used techniques and it makes the proposed solution applicable for the assessment of other environmental media, e.g. in the case of soils favorable content of macro and micro-nutrients or physical properties are expressed as optimum ranges 81,82 .

Materials
Study area. The data was collected in the 2017-2018 years. A detailed description of the territory and the sample net is provided in Shadrin et al 83 . In a nutshell, the sample net is mostly located across the New Moscow region, Russia. According to the official statistics and research reports New Moscow is characterized by the rapid rise of both urban areas and density during the last 10 years 84,85 . The New Moscow is located close to the bottom boundary of southern taiga, in the Central European part of Russia (55 • N, 37 • E) and extends over 1480 km 2 in area. The mean annual temperature of this region is about 3-4 • C. The territory includes all of the common  www.nature.com/scientificreports/ types of land-uses, including urban fabric, forests, and green urban areas, arable lands, industrial cites. The predominant types of natural vegetation are coniferous and broad-leaved forests, while agricultural lands include pastures and arable land mostly growing feed crops and cereals 86 . The mean temperature in the coldest month of the year (January) ranges between -9.5 • C and -11.5 • C, while in the warmest month, July, mean temperatures are between +17 • C and +18.5 • C. The average annual precipitation is approximately 400-500 mm, with around two-thirds by rainfall and the rest by snow, according to recent observations from weather stations' net across the territory (available at https:// www. ncdc. noaa. gov/ cdo-web/ datat ools/ finds tation). The territory has a plain topographic relief, and the bedrock consists of glacial and fluvioglacial loams and sands with the inclusion of sandy alluvial deposits.
Dataset description. The analytical samples were collected from different sources of freshwater, namely: wells, rivers, and springs from the territories of private households. Some of the sample points included the replicated measurements, which have been taken into account at both modeling and results' interpretation stages. Overall, the dataset includes 1569 samples at 460 unique points, each sample consists of longitude, latitude, and list of chemical compounds content and properties, commonly used for water quality assessment. The normative ranges for measured properties are given according to the Russian regulation documents-SanPiN (Sanitary Rules and Norms), number 1.2.3685-21 being in force at the time of the manuscript preparation. These values were given as an example for study support and can be changed according to any other guideline source. Although being measured, Hg, Cd, Co, Pb were eliminated from further work as having very low variability across the dataset. The data points sampled too far from the main research area were removed. Finally, for further modeling, we used 1526 data vectors of 21 properties, namely Alkalinity, Ca, Cl, Cr, Cu, Fe, HCO 3 , Hardness, K, Mg, Mineralization, Mn, NH 4 , NO 2 , NO 3 , Na, Ni, PO 4 SO 4 , Zn, pH (see Table 1).

Code availability
The source code, data, and results are provided in our publicly accessible repository 67 with available interactive visualization via kepler.gl platform and step-by-step instructions.