Using isometric log-ratio in compositional data analysis for developing a groundwater pollution index

This study introduces a novel groundwater pollution index (GPI) formulated through compositional data analysis (CoDa) and robust principal component analysis (RPCA) to enhance groundwater quality assessment. Using groundwater quality monitoring data from sites impacted by the 2010–2011 foot-and-mouth disease outbreak in South Korea, CoDa uncovers critical hydrochemical differences between leachate-influenced and background groundwater. The GPI was developed by selecting key subcompositional parts (NH4+-N, Cl-, and NO3--N) using RPCA, performing the isometric log-ratio (ILR) transformation, and normalizing the results to environmental standards, thereby providing a more precise and accurate assessment of pollution. Validated against government criteria, the GPI has shown its potential as an alternative assessment tool, with its reliability confirmed by receiver operating characteristic curve analysis. This study highlights the essential role of CoDa, especially the ILR -transformation, in overcoming the limitations of traditional statistical methods that often neglect the relative nature of hydrochemical data. Our results emphasize the utility of the GPI in significantly advancing groundwater quality monitoring and management by addressing a methodological gap in the quantitative assessment of groundwater pollution.


Groundwater quality monitoring data
The burial pit sites (n = 30), from which our groundwater quality monitoring data were sourced, are regionally distributed across Korea.They were strategically selected to cover a broad geographic area and to provide a comprehensive dataset, reflecting the diverse hydrogeological conditions of the region.This study was based on the groundwater quality monitoring data collected from a monitoring program conducted to investigate leachate leakages from livestock carcass burial pits formed during the 2010-2011 foot-and-mouth disease (FMD) outbreak in South Korea.This monitoring program was carried out by the National Institute of Environmental Research (NIER) of South Korea on a quarterly basis throughout 2012 for 270 burial pit sites.As the monitoring result, the leachate leakages were confirmed in 29.6% of the burial pits according to the government (Ministry of Environment: ME) guideline for environmental management of carcass burial pits that involves the environmental criterion for three water quality parameters of electric conductivity (EC > 800 mS/cm), NH 4 + -N (> 10 mg/L), and Cl − (> 100 mg/L) 37,38 .Details about the burial sites established following the FMD epidemic and the subsequent monitoring programs are described in previous studies 37,39 .
For our analysis, we used the groundwater quality monitoring data only collected from 30 burial pit sites involving multiple parameters: 3 in-situ measurements (pH, EC, ORP), 10 hydrochemical ions (DO, BOD, COD, Total N, NH 4 + -N, Cl -, Ca 2+ , Na + , TP and PO 4 3 ), and 2 microbial parameters (TB and TC).This monitoring data includes a total of 100 analytical results (samples) comprising two types of groundwater samples representing livestock carcass leachate and nearby groundwater compositions, respectively.The leachate samples were obtained from the perforated drainpipes (as leachate wells: LW) installed at the top of burial pits, while the groundwater samples were collected from monitoring wells (MW) installed 10 m downgradient from burial pits.The nonparametric Mann-Whitney U test was used to evaluate the differences of groundwater pollution indices as well as water quality parameters between the two groups (LW and MW).The water quality analyses and measurements for the parameters were conducted at NIER following the standard methods for drinking water in South Korea 40 ; a detailed description of the methods (including QA/QC) can be found the original publication of NIER (2012) 41 .
www.nature.com/scientificreports/Note that the groundwater quality monitoring data was selected because it clearly demonstrates the relative compositional changes between leachate-influenced and background groundwater.This distinction is crucial for validating the applicability of the ILR in developing a robust GPI within the framework of multivariate CoDA.To quantify groundwater contamination (i.e., leachate leakage from burial pits) using the proposed GPI, an assessment was conducted across the entire dataset.This involved evaluating the differences in contamination indices between two groundwater groups (Leachate Wells, LW, and Groundwater Wells, GW) and examining the discriminatory ability of the contamination index.

Log-ratio transformation
The groundwater quality monitoring data covers the compositional variables (i.e., hydrochemcial parameters) except for some physicochemical parameters and total coliforms.As mentioned above, compositional data is defined as vectors of positive real numbers in which the components (or parts) carry only relative information of some whole data 28 .Since compositional data are constrained to a sample space, called simplex, the standard statistical analyses relying on the Euclidean geometry may obtain spurious results 28,30,31 .Consequently, compositional data analysis (CoDA) involves complex data transformation from the simplex to the Euclidean space before performing statistical analyses; three types log-ratio transformations has been suggested: the Additive Log-Ratio (ALR) and the Centered Log-Ratio (CLR) transformation 28 , and more recently, the Isometric Log-Ratio (ILR) transformation 32 .
The ILR-transformation maps the D-parts (D-dimensional variables) of the composition in the simplex (S D ) into D-1 ILR-coordinates (called balances) in the Euclidean space (R D-1 ), allowing for standard statistical analysis techniques to be applied and preserving the relative information between parts 32,34 .The ILR-coordinates are defined using an orthonormal basis, which is created through a process called sequential binary partition (SBP) 30,42 .This procedure divides the parts of a full composition (or subcomposition) into binary non-overlapping groups in a sequential and hierarchical manner until all of the groups have only a single part.Given a composition of D parts, the ILR (z i ) between two non-overlapping groups can be defined for each of the SBP steps (D-1) as follows: where g(c + ) represents the geometric mean of the r variables of the numerator of the balance, and g(c -) represents the geometric mean of the s variables of the denominator.The ILR-coordinates, especially in high dimensions, are not readily interpretable due to the lack of a direct one-to-one relationship between raw and transformed parts.This necessitates expert-knowledge (e.g., geochemical stoichiometry to construct informative balances 36 .On the other hand, the CLR-transformation preserve the D-parts of the composition and useful for examining the relative variation of each part with respect to the whole compositional data.The CLR from S D to R D is defined as where g(x) is the geometric mean of the composition x.However, the CLR-transformed coordinates are subcompositionally incoherent since it depends on which parts are included in the geometric means as a common divisor.Additionally, their covariance and correlation matrices are singular due to the inherent constraint of all coordinates summing to zero 28,32 .Thus, ILR-transformation is highly recommended prior to the multivariate compositional data analysis 21,22 .Here, we employed the CLR-coordinates to represent the results (loadings and scores) from robust PCA (RPCA) built on ILR-transformed data, back-transformed to CLR space.This facilitates the development of more informative ILR balances as a groundwater pollution index for delineating leachateinduced groundwater pollution.

Robust principal component analysis (RPCA)
Principal Component Analysis (PCA) has been widely used to assess groundwater quality and to identify underlying processes such as groundwater contamination [43][44][45][46] .Moreover, it serves as an objective method for calculating water quality indices (WQI) through weighted linear combinations by selecting and weighting important groundwater quality parameters 8 .Robust principal component analysis (RPCA) is an approach robust to outlying samples unlike classical PCA which is sensitive to outliers 47 .However, PCA is sensitive to outliers in groundwater quality data, which typically exhibit a skewed covariance structure (i.e., a non-multivariate normal distribution) 48 .RPCA is an approach that is robust to outlying samples by identifying and mitigating the influence of outliers, thus providing a more reliable interpretation of the underlying processes in compositional data compared to traditional PCA 21,47 .
RPCA uses the minimum covariance determinant (MCD) estimator to calculate the sample arithmetic mean vector and the sample covariance matrix used for performing PCA.The MCD is designed to search a subset of at least h observations (> half of the total sample size n) with the smallest determinant of their sample covariance matrix resistant to outliers (for details, see Rousseeuw and Driessen, 1999) 49 .Therefore, RPCA based on MCD determines the location (mean) and scatter (covariance matrix) of data with a multivariate normal distribution, and the associated eigenvectors and eigenvalues provide the PC loadings and scores robust to outliers.
RPCA has been effectively applied in compositional data analysis [50][51][52] .The ILR transformation is essential for RPCA preferred over the CLR, due to the singularity of the CLR's covariance matrix, which results from the constant sum constraint of its components 52 .In this study, RPCA was conducted to establish the optimal SBP (1) www.nature.com/scientificreports/for a comprehensive explanation of groundwater quality monitoring data.This involved the identification of a specific subcomposition aimed at evaluating the influence of leachate from livestock carcasses on groundwater, for which the relevant ILR-coordinates were proposed.The ILR transformed data based on an arbitrary orthonormal basis were used in RPCA.For enhanced interpretability, the results (loadings and scores) of RPCA were back-transformed to the CLR and visually depicted via a biplot.

ILR-based Groundwater pollution index (GPI)
This study introduces a novel approach for constructing a GPI through the ILR transformation of multivariate hydrochemical parameters in groundwater quality monitoring data.This method involves a series of statistical procedures, incorporating PCA and CoDa.The framework for developing the ILR-based GPI consists of three major steps: (i) selecting key subcompositional parts using PCA (robust PCA in this study), (ii) performing the ILR transformation via Sequential Binary Partitioning (SBP), and (iii) carrying out normalization according to existing environmental standards or government guidelines.Detailed explanations of these steps are as follows.
PCA aids in selecting a subcomposition of key parameters (parts) that delineate the relative compositional change indicative of groundwater pollution (i.e., the leachate impact on background groundwater quality).These selected subcompositional parts are then transformed into ILR-coordinates (balances) via the SBP.Subsequently, the first ILR-coordinate with the highest variance is chosen to be used as a univariate GPI, effectively contrasting leachate-influenced parameters against those prevalent in the background groundwater.For instance, when dealing with a subcomposition of two parts, the ILR-coordinate Z is defined as follows, according to Eq. ( 1): where C P and C B represent the concentrations in mg/L of parameters indicative of pollution and background respectively.
We normalized the ILR-coordinate Z proposed as a univariate GPI using the environmental criteria mentioned in Sect.2.1.This procedure also validates the ILR-based GPI for practical purposes by comparing it with the government guidelines.The validation compares outcomes categorized into binary groups (polluted and non-polluted), determined by different cutoff values along the ILR-coordinate, with those classified according to the environmental criteria.We calculate diagnostic measures such as sensitivity (true positive rate) and specificity (true negative rate) to assess classification performance at varying cutoff points.From these results, an optimal cutoff of the ILR-based GPI, which yields the most similar classification result to the environmental criteria, can be derived.The optimal cutoff is identified using a receiver operating characteristic (ROC) curve, which plots sensitivity against 1-specificity at various cutoff points.It is chosen at the point on the ROC curve where both sensitivity and specificity reach their highest values.
Finally, the ILR-coordinate Z can be scaled to a centered value Z' by subtracting the cutoff and normalized to a range of 0 to 1 using the maximum and minimum values as follows: This results in a normalized GPI, ranging from 0 to 1, which probabilistically assesses the impact of leachate on groundwater.Here, a normalized GPI value exceeding 0.5 indicates leachate pollution in accordance with the environmental criteria.
Our approach in alignment with the principle of subcompositional coherence in CoDa 28 .This principle ensures that an analysis conducted on a subcomposition is consistent with the analysis of the entire composition.This method provides a statistically reliable GPI that carries the relative information in groundwater quality monitoring data, transforming it from simplex to Euclidean space via the the log-ratio transformation.
All statistical procedures (i.e., CoDa and RPCA) of this study was carried out using the robCompositions 53 package in R software 54 .

Characteristics of groundwater quality monitoring data: absolute versus relative concentrations
Table 1 presents that the median values of most groundwater quality parameters except for redox potential (ORP) have significantly higher (p < 0.05) concentrations in the leachate (LW) than in the nearby groundwater (MW).Previous studies have shown that such livestock mortality leachate contains high concentrations of inorganic and organic compounds (e.g., ammonium, alkalinity, chloride, sulfate, BOD, and COD) as a result of carcass decomposition 37,55,56 .The lower ORP in the leachate (median = -66 mv) compared to the surrounding groundwater (median = 115 mv) is due to anaerobic conditions prevailing in the burial pits 44,57,58 .The carcass leachate leakage from burial pits thus induces the subsequent increases of ionic concentrations in groundwater, exhibiting positive correlations with EC and TDS concentrations but negative correlations with ORP.
Given the elevated ionic concentrations within the leachate wells (LW) relative to the adjacent groundwater monitoring wells (MW), the log-scaled concentrations of Cl-and NH 4 + -N ions have strong positive Spearman correlations with EC values (ρ = 0.65 and 0.61), respectively in the total dataset (combined LW and MW) (Fig. 1).This indicates that the influence of leachate infiltration on proximate groundwater can be quantitatively diagnosed by measuring the correlation coefficients among hydrochemical parameters in the groundwater quality monitoring data.Nevertheless, as mentioned above (Sect.2.2), the hydrochemical parameters are inherently   www.nature.com/scientificreports/compositional parts that carry relative information.Therefore, the correlations computed between any pair of log-transformed variables can be spurious and the log-ratio transformations such as centered log-ratio (CLR) and isometric log-ratio (ILR) are necessary for hydrochemical parameters 28,32 .
Figure 2 shows the bivariate relationships with correlation coefficients between the CLR-transformed values (i.e., relative to the geometric mean of all components) of Cl -and NH 4 + -N and log-transformed EC.A positive correlation is observed for NH 4 + -N (ρ = 0.56), consistent with the log-transformed data.Conversely, Cl -reveals a negative correlation (ρ = − 0.11) despite a positive association in its log-transformed data.This is attributed to the relatively high Cl -concentration in the groundwater from monitoring wells (MW) compared to that in the leachate (LW) (right in Fig. 2).The elevated Cl -levels in MW result from the influence of agricultural practices, such as the use of livestock manures and fertilizers, which affect the background levels of the groundwater near the burial pits, unlike NH 4 + -N, which is primarily originated from carcass leachate.These results demonstrate that the correlation structure in the total dataset can change when considering the relative compositions of individual hydrochemical parameters.

Assessing the influence of leachate on groundwater quality using multivariate CoDa and RPCA
Given the multivariate compositional nature of hydrochemical data, it is necessary to employ a correlation matrix derived from log-ratio transformations to examine the interrelationships among various compositional parameters.Figure 3 shows the significant differences between between the correlation matrices of log-transformed and CLR-transformed variables (excluding EC, ORP, total bacteria, and total coliform) in the total dataset.This comparison demonstrates that the type of data transformation significantly influences the outcome of correlation analysis, as previously shown (Figs. 2 and 3).The result of log-transformed data (lower section of Fig. 3) reveals that the nine hydrochemical parameters (Cl -, Ca 2+ , Na + , BOD, COD, Total N, NH 4 + -N, Total P, PO 4 3-), predominantly concentrated in the leachate (LW), exhibit positive correlations with each other.In contrast, these parameters display negative correlations with pH and redox-sensitive parameters (DO and NO 3 --N), which typically decrease under anaerobic conditions.
On the other hand, the correlation matrix for CLR-transformed data (upper section of Fig. 3) explains the relative compositional relationships based on their source attributions.For instance, parameters primarily originating from leachate (BOD, COD, Total N, NH 4 + -N) show inverse relationships with those dominant in background groundwater (Cl -, Ca 2+ , Na + ) as well as with redox-sensitive ions (DO and NO 3 --N).It is noteworthy that the relative compositions of hydrochemical data are inherently influenced by the proportional contributions from various solute sources, such as carcass leachate and agricultural practices (e.g., livestock manures and fertilizers).Therefore, the application of CoDa (i.e., log-ratio transformations) can be more useful and relevant than using absolute concentrations (such as raw or log-transformed data) for a statistical and practical assessment of the impact of leachate leakage on groundwater quality.
In the context of multivariate CoDa, RPCA provides a more comprehensive explanation of the relative compositional changes in hydrochemical data.In this study, RPCA was applied to the ILR-transformed data, and then the loadings and scores of RPCA were back-transformed to the CLR-coordinates.From the result, the first two principal components (PC1 and PC2), accounting for 34.0 and 29.9% of the total variance, are extracted from the ILR-transformed data (Table 2 and Fig. 4).The loadings exhibit a correlation (or covariance) structure among the twelve hydrochemical parameters.Notably, PC1 has positive correlations with NH 4 + , BOD, and COD, which are predominantly enriched in carcass leachate, while it shows negative correlations with redox-sensitive parameters such as DO and NO 3 --N.Ions such as Cl -, Na + , and Ca 2+ , despite their high absolute concentrations in leachate, show only weak correlations with PC1.This is attributed to their relative abundance in the background groundwater.On the other hand, PC2 has correlations with total P and PO 4 3-.However, both variables are redundant in the interpretation since they are immobile in groundwater due to their adsorption on soils and sediments 59 .Therefore, PC1 delineates the impact of leachate on groundwater quality, showing the relative Table 2. Isometric log-ratio (ilr) of five selected subcompositions and their corresponding binary partitions, indicating the impact of leachate on groundwater.The table also includes Spearman correlation coefficients (R) between the ilr and the PC1 score from RPCA. increase in ionic concentrations compared to background levels and the formation of anaerobic conditions.These results are identical to the outcomes obtained from the correlation analysis on the CLR-transformed data.Accordingly, the robust scores along the first PC1 distinctly differentiate between leachate (LW) and groundwater samples (MW) in the total dataset, while also being robust to outliers (Fig. 4B).The application of RPCA identified 110 outliers, constituting 22.9% of the total samples, which predominantly include leachate samples (LW).This suggests that the score, computed as a weighted linear combination of multivariate hydrochemical parameters, serves as an effective groundwater pollution index for assessing the impact of leachate on groundwater quality.Nevertheless, it is important to note that the eigenvectors, representing the loadings as weights of hydrochemical parameters, obtained from RPCA can be variable depending on the specific monitoring data used.This result significantly demonstrates that RPCA effectively reduces the dimensionality of compositional data and elucidates the impact of leachate contamination of groundwater by estimating a covariance structure that is robust to outlier samples.Additionally, the computation of scores involves complex transformations of the observed concentrations of multiple parameters into log-ratio values.Thus, we aim to identify critical subcompositioal parts that reflect the variability in RPCA scores and introduces their ILR-coordinate as a singular groundwater pollution index (GPI).This index serves as a versatile purpose tool for assessing the impact of leachate on groundwater quality.

Development of ILR-based groundwater pollution index (GPI)
In the context of multivariate CoDa, although the RPCA provides useful scores for evaluating the influence of leachate on groundwater quality, this study has adopted ILR transformation to develop a more straightforward method for formulating a univariate GPI.As explained above (in Sect.2.2.), the ILR transformation results in D-1 Cartesian coordinates, known as balances, based on an orthonormal basis established through the Sequential Binary Partition (SBP) of D selected components.Here, we construct the SBP based on the PC loadings expressed with CLR (Fig. 4), which informs about the important subcompositional parts and their relationships showing the leachate pollution in groundwater quality.Figure 5 illustrates the SBP for the case of a D = 12 subcomposition partitioning the full set of hydrochemical parameters in accordance with the results of RPCA.From this partitioning, the eleven (D-1) independent isometric log-ratio (ILR) coordinates have been derived, according to Eq. (1).
Based on the SBP, we identified the second balance (labeled as ILR2 in Fig. 5 and Z1 in Table 3), which represents a binary partition excluding total P and PO 4 3-, as a critical ILR-coordinate for evaluating the impact of leachate on groundwater.The selected ILR-coordinate (Z1) uses BOD, COD, and NH 4 + -N ions, which is mainly produced from carcass decomposition, as the numerator; meanwhile the denominator involves Na + , Ca 2+ , H + , NO 3 -, DO, Cl -and NO 3 --N ions, which are relatively dominant in the background groundwater affected by agricultural activities and oxic conditions.This log-ratio effectively retains the relative information of the data as shown in the results from RPCA exhibiting a significant correlation (ρ = 0.56) with the first principal component score (PC1) (Table 3).Additionally, it shows a positive correlation (ρ = 0.56) with electrical conductivity (EC) and a negative correlation (ρ = 0.56) with redox potential (ORP).Consequently, this ILR-coordinate is considered a reliable GPI in terms of ratio for assessing the effects of leachate on groundwater quality.
We further examined different ILR-coordinates derived from subcompositions with a reduced number of parts (specifically, D = 7, 5, and 3 parameters), using the same procedure to develop more simplified versions of the GPI.These ILR-coordinates not only correlate well with the PC1 but also effectively account for the variations in EC and ORP (Table 3).This result suggests that the ILR-coordinates sufficiently explain the relative information relevant to the hydrochemical processes by focusing on key parameters, rather than incorporating all measured parameters.This is due to the fact that the ILR transformation ensures the principle of subcompositional coherence of compositional data 60 .
The ternary diagram in Fig. 6 3 explaining 90.1% of the total variance in the distribution.This ratio reflects the increase in NH 4 + -N relative to Cl -and NO 3 --N, and differs shows a significant difference (p < 0.05) between leachate (MW) and groundwater (GW) (Fig. 7B).Consequently, the ILR-coordinates of three specifically selected parts (NH 4 + -N, Cl -and NO 3 --N) provide the most simplified and practical form of GPI while optimally maintaining the essential information of groundwater quality monitoring data.We propose a univariate GPI to quantify the impact of leachate on groundwater, using the following ILR equation: The ILR-coordinate (Z3), proposed as a GPI, was compared with the assessment results of leachate impact on groundwater, as outlined by the government's environmental criteria (mentioned in Sect.2.1.).For this, data samples were categorized into binary groups based on varying ILR values, and these classifications were then juxtaposed with those designated as contaminated or uncontaminated according to the environmental criteria, measuring sensitivity and specificity.Such a comparison not only validates the GPI's potential as a viable alternative to the environmental criteria but also suggests an appropriate GPI cutoff that aligns with the criteria.www.nature.com/scientificreports/ We determined the optimal cutoff for the GPI using a receiver operating characteristic (ROC) curve with an area under curve (AUC) of 0.78, which graphically represents sensitivity versus 1-specificity (recall) across various cutoff points.The optimal cutoff, determined at the point where sensitivity is maximized and 1-specificity is minimized, was identified as -0.87 (as shown in Fig. 7A).At this point, the sensitivity was 0.67, correctly identifying 67% of samples as contaminated according to the Environmental Criteria, while the specificity was 0.88, accurately classifying 88% of uncontaminated samples (Table 3).These results validate the effectiveness of GPI in differentiating between contaminated and uncontaminated groundwater, confirming its reliability as a tool for environmental pollution assessment.Finally, the ILR-based GPI was adjusted to center around the cutoff and normalized between 0 and 1, utilizing the maximum and minimum values according to Eq. ( 4).Within this normalized scale, a GPI value exceeding 0.5 is established as the threshold for identifying leachate contamination, in accordance with the government's environmental criteria.Notably, this normalized GPI revealed that more than 80% of the entire monitoring dataset exceeded this 0.5 threshold, suggesting significant contamination.
This study utilized groundwater monitoring data from areas affected by the 2010-2011 foot-and-mouth disease outbreak in South Korea to highlight the effectiveness of CoDa in distinguishing between leachate  contaminated and uncontaminated groundwater.The GPI, developed using CoDa and RPCA, significantly improves the accuracy and reliability of assessments by considering the relative nature of hydrochemical data, which is often overlooked by traditional statistical methods.The proposed GPI was validated against government environmental standards, demonstrating high sensitivity and specificity in distinguishing between contaminated and uncontaminated groundwater.These results not only validate the reliability of the GPI as an environmental pollution assessment tool but also suggest that it can play a crucial role in complementing existing environmental standards to enhance groundwater resource monitoring and management.Specifically, the CoDa approach proposed in this study overcomes the limitations of traditional methods by considering the relative nature of hydrochemical data, thereby providing a more accurate and reliable assessment tool.This is vital for policy making and environmental management, contributing to the protection and sustainable management of groundwater resources.Furthermore, the methodology and results of this study offer essential groundwork for future research and policy development.

Summary and conclusion
This research introduces an innovative Groundwater Pollution Index (GPI) that employs compositional data analysis (CoDa) and robust principal component analysis (RPCA) to advance the assessment of groundwater quality.Utilizing data collected from the groundwater monitoring of sites affected by the 2010-2011 foot-andmouth disease outbreak in South Korea, this study highlights the effectiveness of CoDa in distinguishing significant hydrochemical differences between leachate-influenced groundwater and unaffected background samples.
The GPI is meticulously developed through a process that involves selecting essential subcompositional parts, specifically NH 4 + -N, Cl -and NO 3 --N, using RPCA, conducting isometric log-ratio (ILR) transformation to address the compositional nature of hydrochemical data, and normalizing these results in accordance with environmental standards.The validation of the GPI against established government criteria, supported by receiver operating characteristic (ROC) curve analysis with an area under curve (AUC) of 0.78 underscores its potential as a robust alternative tool for groundwater pollution assessment.With a sensitivity of 0.67 and specificity of 0.88, the GPI effectively distinguished between contaminated and uncontaminated groundwater samples.
A significant contribution of this study is the emphasis on the importance of CoDa, particularly the ILR transformation, in overcoming the methodological limitations (i.e., outlier and data closure) of traditional statistical methods that often overlook the relative nature of hydrochemical data.This approach significantly enhances the accuracy and reliability of groundwater quality assessments.The proposed GPI aligns with existing environmental standards while serving as a more precise and reliable assessment tool, providing a robust framework for effective monitoring and management of groundwater resources.This is crucial for policy decision-making and environmental management, contributing to the protection and sustainable management of groundwater resources.Furthermore, the methodology and results of this study provide essential groundwork for future research and policy development.Researchers can build upon this work to conduct new studies and further refine the GPI, advancing the field of groundwater quality assessment.

Figure 1 .
Figure 1.Bivariate relationships between log-transformed concentrations of Cl -and NH 4 + -N ions and logtransformed electrical conductivity (EC) (left), and the comparison of these two log-scaled concentrations between monitoring wells (MW) and leachate wells (LW) (right).

Figure 2 .
Figure 2. Bivariate relationships between clr-transformed concentrations of Cl -and NH 4 + -N ions and logtransformed electrical conductivity (EC) (left), and the comparison of these two clr values between Monitoring Wells (MW) and Leachate Wells (LW) (right).

Figure 3 .
Figure 3. Correlation matrix of twelve hydrochemical parameters in the total dataset, showing Spearman correlation coefficients for log-transformed data (upper triangle) and for clr-transformed data (lower triangle).

Figure 4 .
Figure 4. Biplots of loadings and scores from robust principal component analysis (RPCA) of hydrochemical Parameters using isometric log-ratio (ilr).The illustrated loadings and scores have been back-transformed into centered log-ratio (clr) values for interpretation.
shows the distribution of three subcompositional parts () in the Euclidean space.The first of these coordinates corresponds to the Z3 in Table

Figure 5 .
Figure 5. Diagram of sequential binary partition (SBP) for D = 12 compositional parts, used in partitioning the full set of hydrochemical parameters for transformation into 11 balances (ilr-coordinates).The figure shows values of 1 and − 1, representing the compositional parts assigned as the numerator and denominator, respectively, for each balance.

Figure 6 .
Figure 6.Ternary diagram illustrating the relative compositional changes among the subcompositions of NH 4 + -N, NO 3 --N, and Cl -, with principal component 1 (PC1) highlighting the impact of leachate on groundwater (Left), and comparative analysis of the isometric log-ratio values for these subcompositions between Monitoring Wells (MW) and Leachate Wells (LW) (right).

Figure 7 .
Figure 7. (A) Receiver operating characteristic (ROC) curve illustrating the classification performance of the ilr-based groundwater pollution index (GPI) in terms of sensitivity and 1-specificity, compared against environmental criteria (ME, 2011), and (B) histogram depicting the distribution of normalized ilr-based GPI values for the entire dataset of groundwater samples (n = 420), highlighting that 37% of the samples were identified as leachate-impacted using the optimal cutoff value of 0.5 (Z3 = − 0.87).

Table 1 .
Statistical summary of groundwater quality data (n = 420) collected from leachate wells (LW) and groundwater monitoring wells (MW) in the livestock carcass burial pits (n = 30).

Table 3 .
Performance metrics (accuracy, sensitivity, and specificity) of the selected isometric log-ratio (ilr) based groundwater pollution index (GPI) compared with environmental criteria, suggesting the optimal cutoff value for effective groundwater pollution evaluation (positive rate).