Development of a groundwater quality index: GWQI, for the aquifers of the state of Bahia, Brazil using multivariable analyses

This work elaborated a groundwater quality index—GWQI, for the aquifers of the state of Bahia, Brazil, using multivariable analyses. Data from 600 wells located in the four hydrogeological domains: sedimentary, crystalline, karstic, and metasedimentary, were subjected to exploratory statistical analysis, and 22 out of 26 parameters were subjected to multivariable analysis using Statistica (Version 7.0). From the PCA, 5 factors were sufficient to participate in the index, due to sufficient explanation of the cumulative variance. The matrix of factorial loads (for 1–5 factors) indicated 9 parameters related to water quality and 4 hydrological, with factor loads above ± 0.50, to be part of the hierarchical cluster analysis. The dendrogram allowed to choose the 5 parameters related to groundwater quality, to participate in the GWQI (hardness, total residue, sulphate, fluoride and iron). From the multivariable analyses, three parameters from a previous index—NGWQI, were not selected for the GWQI: chloride (belongs to the hardness hierarchical group); pH (insignificant factor load); and nitrate (significant factor load only for 6 factors), also, not a regionalized variable. From the set of communality values (5 factors), the degree of relevance of each parameter was extracted. Based on these values, were determined the relative weights (wi) for the parameters. Using similar WQI-NSF formulation, a product of quality grades raised to a power, which is the weight of importance of each variable, the GWQI values were calculated. Spatialization of 1369 GWQI values, with the respective colors, on the map of the state of Bahia, revealed good correlation between the groundwater quality and the index quality classification. According to the literature on water quality indexing, the GWQI developed here, using emerging technologies, is a mathematical tool developed as specific index, as it was derived using limits for drinking water. This new index was tailored to represent the quality of the groundwater of the four hydrogeological domains of the state of Bahia. Although it has a regionalized application, its development, using, factor analysis, principal component analysis, and hierarchical cluster analysis, participates of the new trend for WQI development, which uses rational, rather than subjective assessment. The GWQI is a successful index due to its ability to represent the groundwater quality of the state of Bahia, using a single mathematical formulation, the same five parameters, and unique weight for each parameter.

the absence of subjective assessments, as they derived from water quality datasets and specific mathematical correlation between variables. Thus, the emerging-technologies can provide methods with global application to develop WQIs for surface waters.
Regarding the studies on groundwater quality indexing, the author 6 found that the WQIs are mainly in the class of water-use specific, as the primary regions of focus are those facing scarcity of surface water, thus, depending on the local aquifers to meet their water demands. These water-use specific WQIs are focused on assessing the hydrogeology of the study area, mainly for drinking and irrigation purposes [55][56][57][58][59] . However, the literature reports initiatives to communicate groundwater quality in the category human-intervention, performance-assessment. For instance 60 , developed the index SEQ-Eaux Souterraines with the support of the French Ministry of Waters, based on two notions: the ability of a water to satisfy a chosen use; and the alteration of water quality due to pressures from human activities. The SEQ-index uses a large number of parameters organized in seventeen groups of alteration, associated to the uses: drinking water, industrial, energy, irrigation and animal feed. It generates sub-indices for each group of alteration, and, the final value of the SEQ-index corresponds to the lowest value attributed to the set of sub-indices.
On the other hand, a variety of groundwater quality indices were derived to help policy makers and stakeholders, regarding the planning and management of groundwater resources. Many indices were derived from WQIs originally developed for surface waters. For instance, the WQI-CCME, due to its statistical formulation and flexibility of parameters selection, was adapted by many authors for groundwater quality evaluation 55,56,[61][62][63][64][65][66][67][68] . Others, adapted the WQI-NSF, after identifying the most significant parameters for the groundwater quality evaluation and their degree of importance 68,69 . The work of 70,71 used the mathematical formulation for the WQI-NSF to derive a groundwater quality index-NGWQI for the state of Bahia, Brazil. The NGWQI development involved the following steps: (i) subjective assessment for choosing the representative variable of the state of Bahia groundwater quality (hardness, chloride, fluoride, nitrate, total residue, and pH); (ii) subjective assessment to define the degree of importance of each chosen variable; (iii) development of normalized curves of concentration (c i ) versus grades (q i ), using the limits from Brazilian drinking water legislation to set the quality range from 0 (worst) to 100 (best); (iv) transformation of the physicochemical values in dimensionless subindices using the normalized curves; and, (v) calculation of the single value for the NGWQI based on the product of each grade (qi), raised to its weight (wi), or the degree of importance. The spatialization of the NGWQI in the four hydrogeological domains of the state of Bahia: sedimentary, crystalline, karstic and metasedimentary, was considered with good correlation with the groundwater quality, by hydrogeologists from CERB, the governmental drilling well company.
For groundwater applications, the trend of WQI development in the category of emerging-technologies, has also grown. In this category, can be cited: multivariate statistics and regression models 72 ; multivariate statistics, probability curves and GIS 69 ; regression models 73 ; fuzzy methodology and GIS 74 ; artificial neural network-ANN and multiple linear regression-MLR 75 ; and entropy information theory [76][77][78][79][80][81] . The following conclusions were reached by these authors about the development of these WQIs: (i) jointly application of correlation analysis and multivariable linear regression helped to identify the sources and factors affecting the groundwater pollution of an urban aquifer. The regression model derived for the groundwater quality prediction was reliable and stable; (ii) probability curves defined the critical variables, and PCA determined the principal water quality parameters and their weights, to compose the WQI; (iii) regression models allowed predictions about past, present, or future groundwater quality events, in a less expensive manner, either in terms of time and/or money; (iv) hybrid WQI model, fuzzy-GIS-based, using seven critical parameters, was more reliable and pragmatic for groundwaterquality assessment and analysis at a larger scale; (v) jointly application of ANN and MLR models predicted precise values for a WQI, with sensitive performance for two seasons; (vi) information entropy methods avoid personal judgments about the weight of the parameters to participate in the WQI; (vii) entropy weighted water quality index (EWQI) has been recognized as the most unbiased model for assessing drinking water quality. Based on these comments, the WQIs developed for groundwater quality applications, were all tailored for local or regional situations. However, the most important feature of using the emerging-technologies is the absence of subjective assessments, as they derive from water quality datasets and specific mathematical correlation between variables. Similarly to surface waters, the emerging-technologies can provide methods with global application to develop WQIs for groundwater resources.
The present work develop a groundwater quality index-GWQI for the state of Bahia, Brazil, in the category of emerging-technologies, using the multivariable techniques: factorial analysis (FA), principal component analysis (PCA), and hierarchical cluster analysis (HCA), for parameter selection and determination of the degree of importance of each parameter. The method was totally rational, independent of subjective assessment, participating of the new trend for WQI development.

Study area: state of Bahia, Brazil
Geology and hydrogeology of the state of Bahia. The study area is the whole state of Bahia, a federative unit in the Northeast Region of Brazil. The state of Bahia is approximately located between the coordinates 38ºE to 46ºW of longitude, and 9ºN to 17ºS of latitude, with an area of 567.295 km 2 , being the largest northeastern state in terms of land area, and fifth in the national ranking 82 . Figure 1 presents the map of the hydrogeologic domains of the State of Bahia 82 , modified from 83 , indicating the presence of eleven domains with the respective lithologies.
The Fig. 1 shows that the state of Bahia has great geological and hydrogeological diversity. In the coastal region, east of the state (18 to 65 km wide), occurs from north-to-south the sedimentary basins (Tucano, Reconcavo, and Southernmost). Next, emerges from north-to-south the crystalline domain (rainfall < 800 mm/year; and > 800 mm/year), plus detrital covers (shallow) at the south. In the central area occurs the karstic domain  83 delimited areas of similar hydrogeological behavior and groundwater production (Fig. 1). In the state of Bahia, the climate and precipitation have the following distribution: coastal region (humid; 1400-2600 mm/ year); next stripe parallel to the coast, also at the western region (humid to sub-humid; 1000-1400 mm/year); following stripe parallel to the coast, also at the center, in addition to the karstic terrains in high topography (sub-humid to dry; 800-1200 mm/year); terrains of crystalline and karstic domains at the center/north (semiarid; ≤ 600 to ≤ 800 mm/year); far north (arid; 300-500 mm/year); far western region, in the stripe of 20-80 km wide, (humid to sub-humid; 1300-1600 mm/year).
The Table 1 presents a description of the geological characteristics of the hydrogeological domains of the State of Bahia, and some aquifer characteristics: groundwater productivity and quality, from 85 Table 1. Also discuss the results published by 85,86,88 , to address the importance of the parameters (hardness, total residue, sulphate, and iron), selected to be part of the GWQI. The work of 85 presents the average values for (chloride, total hardness, total residue, and nitrate) in the groundwater of the state of Bahia, whose limits for drinking water 87 , are, respectively, (250 mg/L; 500 mg/L; 1000 mg/L; and 10 mg/L). The work of 86 studied the parameter iron in the groundwater of the state of Bahia, based on 5583 wells drilled in the period (2003-2013). He found 978 wells (17.5%), with high iron content (> 0.3 mg/L), the limit for drinking water 87 . The work of 88 studied the parameter sulphate in the groundwater of the state of Bahia using the same data base from 86 . She found from 2792 wells, 289 (10.4%) with high sulfate concentration (> 250 mg/L), the limit for drinking water 87 . The predominant species with high sulfate concentration were (CaSO 4 and MgSO 4 ), smaller quantities for (NaSO 4 ) and very low for (KSO 4 ). She found that aquifer geology, and not rainfall, was the most influential on sulfate concentration and species. From 85 , the deep sedimentary coverage presents average values for (chloride, total hardness, total residue, and nitrate) below the limits (groundwater of good quality); nevertheless, due to its shallowness or not so deep layers, it presents high vulnerability to contaminants.
In general, the sedimentary basin has predominance of sandstone with water of good quality, however, areas with small recharge and variable flow rates has tendency of salinization in depth. The Tucano sedimentary basin presents some lithological aspects (layers of shale and carbonates) that favor the occurrence of groundwater with variable quality. From 85 , the sedimentary basins present, in general, average values for (chloride, total hardness, total residue, and nitrate) below the limits (groundwater of good quality), only the (Sergi/Aliança formations) presents chloride slightly above the limit.
The fractured crystalline aquifer presents unfavorable water circulation, thus has generally water of inferior quality. From 85 , the crystalline domain presents average values above the limits, for three parameters: chloride, total hardness, and total residue (groundwater with quality regular or poor); while for nitrate, the average values are below the limit, indicating not significant human impact. From 86 , the mixed sedimentary/crystalline aquifer www.nature.com/scientificreports/ had (46.9%) of wells with high iron content, due to the presence of iron producing rocks in the crystalline portion, and larger water circulation in the sedimentary portion. The karstic domain, due to the presence of carbonates, only presents water of better quality in places were the rainfall rates are favorable. From 85 , the karstic domain (> 800 mm/year) presents the average values for (chloride, total hardness, total residue, and nitrate) below the limits (groundwater of good quality), clearly related with the larger rainfall; while the karstic domain (< 800 mm/year) presents average values above the limits (groundwater with quality regular or poor). From 86 , the karstic domain had the smallest percentage of wells with high iron content (9.88%).
The metasedimentary domain with free aquifers of fissural and fractured nature, associated with a variety of lithological and geological characteristics, presents groundwater from regular to inferior quality. From 85 , for the metasedimentary domain, only the parameters, total hardness and nitrate, present average values below the limits, indicating varying groundwater quality.
For nitrate 85 , found, only in the karstic domain (< 800 mm/year) an average value (10.7 mg/L) slightly above the limit established for drinking water (10 mg/L). Nitrate is an anthropic groundwater chemical parameter derived from fertilizers (the karstic domain has extensive agricultural activities), and from domestic wastewater (the urban area uses septic tanks and has inadequate sewer system). The presence of nitrate in the aquifer of karstic terrains is also favored by the presence of caves and dolines.
Geostatistic applied to parameters of groundwater quality of the state of Bahia. For the parameters, chloride, total hardness, total residue, and nitrate in the hydrogeological domains of the state of Bahia, the work of 85 developed semivariograms, a geostatistical tool to investigate how much the variable is regionalized, which characterize a natural phenomenon 89,90 . In the semivariogram function, the parameter (a) represents the maximum distance at which the variables correlate with themselves.
A regionalized variable is indicated by a spatial correlation structure, or, a function [Z(x)] for each point (x) in the space n dimensional (R n ), presenting two characteristics: randomness, or erratic variations; and structure, or the global aspect of the regionalized phenomenon. To study spatial and temporal variability of a given property, the geostatistic may assist in identifying the most probable spatial patterns of a parameter distribution 91,92 . The literature presents a variety of geostatistical tools that allow estimating the probability of occurrence of a given event, in places not investigated, from information obtained elsewhere 93,94 . When samples are collected in the field, it is necessary, before to proceed an interpolation between two measured locations, to build up isoline maps with the appropriate tool to establish the spatial dependence. The semivariogram indicates the most appropriate spatial dependency function of the variable under study 89 . Once the semivariogram is known and the spatial dependence is confirmed, values can be interpolated at any position in the field of study, and the interpolation method is called Kriging 93,94 .
From the work of 85 , the variables (chloride, total hardness, and total residue) present the parameter (a) with values (204.3; 236.9; and 170.7 km), respectively, indicating that these are regionalized variable. For nitrate, the parameter (a = 4.95 km), a relatively small distance, after which the nitrate values no longer correlate, and this is not a regionalized variable. The spatialization of nitrate values in the groundwater of the state of Bahia by 85 , indicated high nitrate concentrations in the most vulnerable areas of the karstic and crystalline aquifers, due to three main factors: shallow aquifers; karstic and fractured structures; and vectors of pollution (irrigated agriculture and domestic wastewater effluents).

Material and methods
Selection of wells and grouwndwater samples for statistical analysis. The database from the state of Bahia well drilling company, CERB -Water Resources and Environmental Engineering Company 95 , provided a comprehensive amount of data for the hydrogeological domains of the State of Bahia. The physicochemical analysis were developed at LABDEA, the laboratory of the Environment Engineering Department, Polytechnic School of the Federal University of Bahia (UFBA).
A total of 600 from 1969 wells, were used to apply the statistical analysis and develop the groundwater quality index-GWQI. The remaining 1369 wells were used to apply the GWQI and to test the model adequacy to describe the state of Bahia groundwater quality. The Table 2 presents for both sets (600 and 1369 wells), the www.nature.com/scientificreports/ statistics for the number of wells and municipalities envolved, per hydrogeological domain, considered here, as criteria to guarantee the sample randomness. From Table 2, the percentages verified in the set (600 wells) are not exactly the same, as those in the set (1369 wells), however, they are close enough to guarantee the similarity. For instance, 78.2% of the total number of municipalities in the set with 1369 wells (335), it is present in the set of 600 wells (262), indicating good areal distribution of the wells. Classifying per hydrogeological domain, it is verified that, from 66.7 to 87.1% of the number of municipalities in the set (1369 wells), it is present in the set of (600 wells), indicating good hydrogeological representativeness. Thus, the sample of 600 wells can adequately represent the total.
The data bank with the 600 wells was submitted as Supplementary Material. Also, was submitted the data bank with 1369 wells, including the necessary information to calculate the GWQI and the previous index NGWQI, for comparison. These two spreadsheets present summary tables and statistical results discussed in the paper.
Multivariable statistical methods. Multivariable analysis are largely applied to environmental data, seeking to identify the significant parameters from a large data set of multiple variables 22,[96][97][98][99][100][101][102] . To identify the factors responsible for the groundwater pollution in a shallow urban aquifer of Yan'an City, in China 72 , used the methods of principal component analysis (PCA), hierarchical cluster analysis (HCA), and multivariable linear regressions (MLR) to search the relationships between the groundwater quality parameters and to generate a regression model. Also, in China 103 , used multivariate analysis to understand the hydrogeochemical processes occurring in the water of the Guohua phosphorite mine. In India 104 , used these techniques to elucidate aspects of the groundwater geochemistry and drinking water suitability in the Kudal region. In Brazil, state of Bahia 105 , applied multivariable analysis for groundwater quality evaluation in the central-southern portion of the state, while 106 , used to classify the groundwater quality in the Salitre river watershed. And, in the state of Ceará 107 , used to explain the processes responsible for the groundwater quality in the city of Fortaleza; while 108 , were searching the similarity of hydrochemical variables in the Salgado river watershed.
The multivariable methods applied in this work were factorial analysis (FA); the principal component analysis (PCA) and the hierarchical clustering analysis (HCA). The factorial analysis was used to define the structure of the variables correlations 109 . The (FA) calculates the correlation matrix between variables, it does the extraction of initial factors and does the rotation of the matrix 109 . The correlation matrix allows to indicate the similarities and differences in the cluster analysis 110 .
The method of principal component analysis (PCA) helps to extract the factors from the correlation matrix, necessary to explain the covariance structure through linear combinations of the original variables 111,112 . The (PCA) reduces the total number of variables to a smaller data set of statistical variables, while preserving the variability with a minimal loss of information. Each factorial load represents the degree of contribution of the variable to the formation of the factor. The variables with the highest factorial load are considered of greater importance and should influence more on the factor label 109 . The PCA also helps to detect through communalities, how much each parameter explains each factor 113 . The normalized Varimax rotation, an orthogonal rotation of the factors, helps to minimize the number of variables with high loads in different factors.
The hierarchical cluster analysis (HCA) has the goal to produce the variables hierarchical classification, necessary to detect the most pertinent properties to be included in the index. The (HCA) build the tree diagram where the most similar properties in the study are placed on branches that are close together 109 . The clustering was performed using the method of 114 , which creates a small number of clusters with relatively more properties. The cluster analysis define the similarities and dissimilarities between variables through a dendrogram. The key to interpreting a dendrogram is to look at the point at which any given pair of properties join together in the tree diagram. The pair that join together sooner are more similar to each other than those that join together later. In the present work, the (HCA) helped to detect the most pertinent properties to be included in the groundwater quality index-GWQI.

Results and discussion
Application of the multivariable analysis to develop the Gwqi. The sample with 600 wells was a satisfactory number to apply the multivariable analysis, according to the simplified approach from 109 . For these authors, the number of cases for the factorial analysis must be at least 5 times the number of measured variables. The number of measured variables indicated in the CERB database was (26), then, (5 × 26 variables = 130). Thus, the sample of 600 wells was representative of the 1369 wells used to test the index, besides it was a random sample, as demonstrated in the topic 3.1.
The results from the exploratory analysis are on Table 3. It involves the descriptive statistics (minimum, maximum, and average; the quartiles, lower, upper and median; standard deviation, standard error and confidence interval), calculated with the software Statistica, version 7.0 115 . From the 26 variables from CERB data bank, were excluded the variables considered not representative or presenting nonconformities: sodium (9 valid samples), potassium (7 valid samples), ammoniacal nitrogen (4 valid samples), and acidity (not representative). The Table 3 presents only the 22 variables that will be the input for the multivariable analysis.
In Table 3, eleven variables present values equal to zero (0.0) as their minimum. These values resulted from the substitution of the laboratorial expression (below detection limit) by (zeros). These "not measured data" receive, in the literature, the designation of "censored data". The authors 116 discuss four different procedures for solving the censured data: substitution, parametric methods, robust methods and non-parametric methods, all of them, presenting advantages and limitations. They say that, the simplest method to replace the undetected values is using a constant value below the detection limit. However, any value between zero and the detection limit can lead to deviations in the descriptive statistics: zeros, tend to produce underestimated averages, and the detection limit, tends to produce overestimated averages. In Table 3 www.nature.com/scientificreports/ data. The impact of this choice was evaluated, calculating the averages for both extremes (zero and detection limit). The spreadsheet for 600 wells (Supplementary Material) presents the averages and standard deviations for the parameters iron, fluoride and sulphate (ones with the largest amount of zeros), showing small impact. Consequently, in this work, the substitution by zeros has no negative consequences. The multivariable analysis developed in this work, applied the methods of factorial analysis (FA), the principal component analysis (PCA), and the hierarchical cluster analysis (HCA), using the Statistica, version 7.0 115 . To identify the optimal number of factors to participate in the GWQI, Fig. 2 shows the criterion of the latent root. As recommended by 109 , only factors with latent roots or eigenvalues greater than one are considered significant. Figure 2 shows that the limiting value is 7 factors. www.nature.com/scientificreports/ Another procedure to decide how many factors to participate of the GWQI, is the criterion adopted by 118 , which is, maintaining a minimum explanation of 60% of the cumulative variance. Table 4 presents the eigenvalues from the principal component analysis (PCA), the percentage of variance explained by each component; and the cumulative variance. The cumulative variance for five (5) factors, which is equal to 63.91%, satisfies the recommendation, and was adopted in the present work.
The Table 5 shows the matrix of factorial loads, after the Normalized Varimax rotation performed on the factorial axes. The factorial load is the correlation of the variable with the respective factor. If that load assumes a positive value, means that the variable has a positive correlation with the factor, and if it is negative, this correlation is negative, or, the variable has a direction of variation opposite to that of the construct. The Table 5 shows both results, positive and negative.
The recommendation from 109 is that, factor loads with values above ± 0.50 are of practical significance, however, this work adopted a factor load higher than the minimum recommended value. For instance, from (Factor 4), the parameter iron, with factor load (0.613), was the minimum value considered significant in this work.
The application of the principal component analysis (PCA) helped to evaluate the variable level of explanation relevant to the analysis. Figures 3, 4, 5 show the graphical representations of the factorial plans: Fig. 3 (Factor 1 × Factor 2), Fig. 4 (Factor 3 × Factor 4), and Fig. 5 (Factor 4 × Factor 5).
In Fig. 5 (Factor 4 × Factor 5), the (Factor 4) explains 7.67% of the total variability of the data, and (Factor 5) explains 6.01%, as shown on Table 7. From (Factor 5), the parameter flow rate with factor load (− 0.792) is the most significant. Consequently, the number of factors to be involved in the hierarchical cluster analysis is thirteen (13), with nine (9) related to water quality, and four (4) are hydraulic parameters. However, the hydraulic parameters, not related to water quality, will not be considered to compose the GWQI. Figure 6 presents the dendrogram from the hierarchical cluster analysis (HCA).
The dendrogram shows the formation of 3 groups of parameters with high internal similarity: "hardness x chloride", "total residue x conductivity", and "calcium x magnesium and sulphate". This work choose only five (5) relevant variables, a total that responds for 63.91% of the total variance, satisfying the recommendation from 118 . The choices were: hardness (instead chloride, as they belongs to the same group); total residue (instead conductivity, as total residue is a chemical parameter); sulphate (instead calcium or magnesium, as both variables are present in hardness). In addition, were considered fluoride and iron, which are independent from each other. Thus, the variables to include in the GWQI to express the state of Bahia groundwater quality are: hardness, total residue, sulphate, fluoride and iron. The next step for the GWQI formulation is, to define the degree of relevance of each parameter, in order to establish the relative weight (w i ), necessary to the GWQI model. The starting point was to examine the communality values calculated after the normalized Varimax rotation, which represent the amount of variance explained by each variable in the factorial solution. The Table 6 presents the communality values (from 1 to 6 factors).
The largest communality value in the column (5 factors), is hardness (0.972), providing the greatest relative weight (w i ). The others are: total residue (0.962), sulphate (0.579), fluoride (0.511), and iron (0.444). Then, on Table 7 it is demonstrated the procedure to obtain the weights (w i ), based on the communality values for the five parameters (hardness, total residue, sulphate, fluoride and iron).
Using the communality values, and the procedure defined in this work, the relative weight (wi) for each parameter is: hardness (0.28), total residue (0.27), sulphate (0.17), fluoride (0.15), and iron (0.13). The sum of the five weights add to one (1.00). www.nature.com/scientificreports/  www.nature.com/scientificreports/ Thus, the multivariable analysis helped to define the five parameters to represent the groundwater quality of the state of Bahia; and the weight of importance for each parameter (w i ), independent of subjective assessments. The next step is to transform the chemical concentration (c i ) for each variable, in dimensionless grade (q i ), to calculate the GWQI value for each water sample.
Nonlinear fit to transform dimensional groundwater quality parameters in dimensionless subindices. It was necessary to develop empirical curves, with chemical concentrations in the abscissa and grades (from 0.0 to 100.0) in the ordinate. The grades were defined using the limits for drinking water, from the Resolution 2914/2011 87 . The Fig. 7a-e show the curves (concentration versus grade) for the parameters (hardness, total residue, sulphate, fluoride, and iron), and the mathematical models derived using the nonlinear curve fitting from the statistical package Statgraphics Centurion XVI 119 .
The Table 8 presents the nonlinear fit for the five parameters (hardness, total residue, sulphate, fluoride, and iron), the respective fitting constants, the validity intervals, and the respective correlation coefficients R 2 .
Mathematical formulation for the groundwater quality index. The mathematical formulation for the GWQI is similar to the formulation of the WQI-NSF, a product of grades (q i ) raised to a power (w i ), or the degree of importance of each parameter in the water quality (Eq. 1). To investigate if the GWQI values for the sample of 1369 wells (Supplementary Material), are affected by the characteristics of the sample of 600 wells (Supplementary Material), used to develop the GWQI, it was calculated for this sample, the number of wells, per hydrogeological domain, in which, the concentrations for the parameters (hardness, total residue, sulphate, fluoride and iron), are, below and above, the limits for drinking water 87 . The calculations presented in the spreadsheet (600 wells), indicated an averaged percentage of 70.5% for the set (concentrations below the limits); and, averaged percentage of 29.5% for the set (concentrations above the limits). Based on these results, it is expected for the sample of 600 wells, around 70.5% of grades (GOOD and GREAT), and around 29.5% of grades (BAD and POOR). These results are quite different from the sample (1369 wells), with 69.5% (BAD and POOR), and, 30.1% (GOOD and GREAT). The difference between the samples indicates that the calculation of the GWQI, for the 1369 wells, was not biased, and the multivariate process not flawed.
To visualize how the GWQI values, and the respective grades, are correlated with the characteristic of the groundwater sample, Table 9 shows, for ten wells located in the crystalline and karstic hydrogeological domains, the GWQI values and grades, and the concentration for the parameters (hardness, total residue, sulphate, fluoride and iron). The data were taken from the set of 1369 wells (Supplementary Material).
The data on Table 9 show GWQI values from 4.27 to 87.52, very well correlated with the parameters concentration: (i) if parameters have concentrations above the limits, the grades are (BAD and POOR); (ii) if concentrations are close to the limits (REGULAR); and, (iii) if concentrations are below the limits (GOOD and GREAT).
Finally, with the objective to compare the groundwater quality evaluation resulting from the new index (GWQI), with the previous index NGWQI ( Oliveira et al. 2007) 71 , it was examined the number of similar and dissimilar results using both indices. These results are presented in the spreadsheet for 1369 wells (Supplementary Material).
Examining the similarity between the grades it was found that: the grades (GOOD and GREAT by NGWQI), have similarities (44.5 and 53.7%) with the grades (GOOD and GREAT by GWQI), which means, around half Table 7. Description how to obtain the relative weight (w i ) of each parameter.

Parameters
Commonality values (from Table 9) www.nature.com/scientificreports/  www.nature.com/scientificreports/ of the wells had similar groundwater quality evaluation by the two indexes. Significant correspondence was verified only for the inferior grades, for instance, 100% correspondence occurs between (POOR by NGWQI) with (BAD + POOR by GWQI); and 94.2% correspondence occurs between (REGULAR by NGWQI) with (BAD + POOR by GWQI). The explanation for these results, is that, GWQI and NGWQI have, in common, only the parameters (hardness, total residue, and fluoride). Using the multivariable techniques, the parameters (sulphate and iron) were included in the GWQI, while the parameters (chloride, nitrate, and pH) were discharged. The parameter chloride, though with significant factor load, belongs to the same hierarchical group as hardness; pH has no significant factor load; and nitrate, significant only in (Factor 6), it is not a regionalized variable.
The superiority of GWQI lies in the analytical methodology used for its development, instead subjective assessment, based on experts' opinion. The multivariable analysis allowed, unequivocally, to include in the index, the most significant parameters to qualify the groundwater of the state of Bahia, besides to indicate the degree of importance, or weight, for each parameter.
The Fig. 8 shows the spatialization of colored dots, on top of the map of the state of Bahia, corresponding to the GWQI grades for the set of 1369 wells.
The Table 10 summarizes the relation between the GWQI colors (quality indicators), the characteristics of the hydrogeologic domains and the groundwater quality, associated to the map of Fig. 8.
The summary on Table 10 reveal good comparison between the groundwater quality and the water quality classification using the GWQI.

Conclusions
This work had the objective to develop a groundwater quality index (GWQI) using multivariable analysis techniques. The goal was to improve the performance of a previous index (NGWQI) developed by the research group, using a subjective assessment, through the opinion of experts, represented by hydrogeologists from CERB, the state of Bahia well drilling company.
The major steps of the GWQI development i.e. parameter selection and their respective weights, were totally achieved with the techniques of factorial analysis, principal component analysis, and hierarchical cluster analysis. The PCA helped to define the number of five (5) factors (or variables), which explained 63.91% of the cumulative variance, to participate in the GWQI. The matrix of factorial loads, after the normalized Varimax rotation, indicated the nine (9) water quality parameters to participate of the HCA; and the dendrogram helped to select the five parameters to participate in the GWQI (hardness, total residue, sulphate, fluoride and iron). From the set of communality values, the degree of relevance of each parameter was identified, and, the relative weight (w i ) for each parameter, was determined. Finally, using nonlinear regression, the normalized curves of concentration versus grades allowed to generate the grade (qi) for each variable concentration. Moreover, the multiplicative formula which operates the dimensionless subindex (q i ) raised to a power (w i ), or the weight of importance of each variable, allowed to calculate the values for the GWQI.
Comparison between the groundwater quality evaluations resulting from the new index (GWQI), with the previous index (NGWQI) indicated around half of the wells with grades (GOOD and GREAT by NGWQI) with the same grades (GOOD and GREAT by GWQI), which means, the classifications are not exactly the same using the two indexes. The reason is that, the two indexes have in common, only, the parameters (hardness, total residue, and fluoride). The multivariable techniques included in the GWQI the parameters (sulphate and iron) and removed the parameters (chloride, nitrate, and pH) from the previous NGWQI.
The use of multivariable techniques to develop the GWQI is advantageous, as the multivariable analysis allowed, unequivocally, to select the most significant parameters to represent the groundwater quality, and indicated the degree of importance of each parameter. The new index, GWQI, has the ability to represent the groundwater quality of the state of Bahia, using a single mathematical formulation, with the same five parameters, and raised to unique weight, for each parameter. Table 9. Ten values of the GWQI calculations with the Eq. (1) and grades from the GWQI and the NGWQI previously derived. Drinking water standards (Brazil 2011): hardness = 500 mg/L; total residue = 1000 mg/L; sulphate = 250 mg/L; fluoride = 1.5 mg/L; iron = 0.3 mg/L). www.nature.com/scientificreports/  www.nature.com/scientificreports/

Data availability
It was submitted to the journal, as Supplementary Material, the spreadsheet with 600 wells used to develop the multivariable analyses, to define the choice of parameters to participate in the GWQI, and the degree of relevance of each parameter. It was also submitted, the spreadsheet with 1369 wells used to test the formulation for the GWQI in the state of Bahia.  www.nature.com/scientificreports/