Introduction

With the development of economy and society, the demand for mineral resources is steadily escalating. Mineral resources serve as the indispensable material foundation for human production activities1,2,3. Over the years, the development and utilization of mineral resources have necessitated a shift in mining focus, transitioning coal mines towards the extraction of intricate refractory mining bodies, such as deep orebody, broken soft orebody, alpine area orebody and low-grade orebody, and “three lower and one upper” ore bodies4. As mining intensity and depth increase, the extraction of mineral resources within complex geological structures becomes more challenging, giving rise to a surge in engineering predicaments. Among these challenges, mine water disasters emerge as a prominent threat to mining operations. Hence, the timely and precise identification of water source categories, constitutes essential prerequisites for averting water-related disasters and establishing a scientific foundation for swift rescue and management endeavors5,6.

Water chemistry data plays a crucial role in understanding the fundamental characteristics of aquifers and is vital for discriminating water sources7. Qualitative and quantitative methods are commonly employed to analyze water chemistry information for this purpose. Qualitative analysis, combined with water level dynamics, provides a rough determination of the syncline level. Piper's trilinear water chemistry analysis, on the other hand, is a convenient and visual tool for water quality classification and ion distribution8. The modified D-Piper trilinear diagram provides a solution for the challenge of visualizing ion distribution in large data sets9, leading to improved visualization and interpretation with an increase in data points. In addition, it is crucial to consider physicochemical information such as isotopes and radioactive elements in water bodies to reflect the essential characteristics and historical evolution of hydrogeology. The hydrogeochemical distribution, recharge sources, indicator tests, influencing factors, and evolutionary laws are analyzed based on conventional water chemistry, trace elements, and isotopes of the aquifer10. Gibbs’ semi-qualitative model11 is employed to analyze the hydration types of surface water and shallow groundwater, providing insights into the controlling factors, formation mechanisms, and recharge sources of isotopes in various aquifers. This analysis reveals the distinct weathering and hydration characteristics of different water bodies. However, qualitative methods alone face limitations in similar aquifers due to the ambiguous relationship between indicators, overlapping water quality characteristics, and unclear distribution boundaries12. To overcome these limitations, quantitative analysis13 is utilized to uncover the inherent laws of water chemistry data, establish mathematical models for determining water source types, elucidate the close connection between water quality indicators and determination criteria, and minimize the errors associated with qualitative analysis methods.Fisher function discrimination of water source locations based on fuzzy clustering and factor analysis14,15 and Bayes classification of water sources16,17 are employed to determine the water sources of sudden water in the mine area, with improved accuracy of discrimination. Groundwater is subject to multiple factors coupling due to the variability of mine geological structure, the complexity of hydrogeological characteristics, and the diversity of mining conditions, resulting in fuzzy connections and complex nonlinear relationships between water quality indicators and discriminatory criteria. However, model studies for index simplification through data dimensionality reduction are limited, and the redundancy of information between water chemical components reduces discriminative accuracy, requiring further optimization of the discrimination model.

This study addresses the water quality assessment system by introducing a novel approach that combines qualitative and quantitative analysis. A key contribution of this research is the utilization of Piper's trilinear diagram graphical method to analyze the variation pattern of ionic composition in aquifers and water chemistry characteristics through point mapping. By comparing the differences in ionic composition among aquifers and evaluating the proximity to the target water body, an initial classification of water quality is established.This fills the gap in existing research on risk factor internal information mining and machine learning, and provides a foundation for subsequent quantitative water source discrimination. To achieve this, a coupled discrimination model, integrating the R-factor and Support Vector Machine, is developed to uncover inherent characteristics within water chemistry data and automatically establish the mapping relationship between water quality indices and evaluation criteria. This innovative approach enables precise identification of water source types and provides valuable guidance for effective water damage control in practical engineering applications.

Theoretical basis

Principle of R-factor dimensionality reduction

There are m test variables \(Z_{i} (i = 1,2,3, \cdots ,m)\), which may be correlated, and each \(Z_{i}\) contains independently existing common factor \(f_{j} \left( {j = 1,2, \cdots ,p} \right)\), \(P \le m\) where \(Z_{i}\) contains m mutually uncorrelated unique factors \(u1,u2,u3, \cdots ,um\), and u and f are mutually uncorrelated. Each Z can be linearly characterized by f and u as18:

$$\left\{ {\begin{array}{*{20}l} {Z_{1} = a_{11} f_{1} + a_{12} f_{2} + \cdots + a_{,p} f_{p} + c_{1} u_{1} } \hfill \\ {Z_{2} = a_{21} f_{1} + a_{22} f_{2} + \cdots + a_{2p} f_{p} + c_{2} u_{2} } \hfill \\ \vdots \hfill \\ {Z_{m} = a_{m1} f_{1} + a_{m2} f_{2} + \cdots a_{np} f_{p} + c_{m} u_{m} } \hfill \\ \end{array} } \right..$$
(1)

Expressed as matrix:

$$\left( {\begin{array}{*{20}c} {Z_{1} } \\ {Z_{2} } \\ \vdots \\ {Z_{m} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {a_{11} } & {a_{12} } & \cdots & {a_{1n} } \\ {a_{21} } & {a_{22} } & \cdots & {a_{2n} } \\ \cdots & \cdots & \ddots & \cdots \\ {a_{m1} } & {a_{m2} } & \cdots & {a_{nn} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {f_{1} } \\ {f_{2} } \\ {f_{3} } \\ {f_{4} } \\ \end{array} } \right) + \left( {\begin{array}{*{20}c} {c_{1} u_{1} } \\ {c_{2} u_{2} } \\ \vdots \\ {c_{m} u_{m} } \\ \end{array} } \right).$$
(2)

Abbreviated as:

$$Z = A \cdot F + C \cdot U.$$
(3)

The factor analysis method lies in replacing Z by F through Eqs. (2) and (3), conditioned on \(p < m\), which can streamline the number of dimensions to reduce redundancy. The specific steps are19:

  1. (1)

    Construct sample matrix and perform correlation test,

Collect the p-dimensional random variable \(X = (x_{1} ,x_{2} , \cdots x_{p} )^{T}\) and construct the sample matrix:

$$X = \left[ {\begin{array}{*{20}c} {x_{1}^{T} } \\ {x_{2}^{T} } \\ \vdots \\ {x_{n}^{T} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {x_{11} } & {x_{12} } & \cdots & {x_{1p} } \\ {x_{21} } & {x_{22} } & \cdots & {x_{2p} } \\ \vdots & \vdots & \vdots & \vdots \\ {x_{n1} } & {x_{n2} } & \cdots & {x_{np} } \\ \end{array} } \right].$$
(4)

The KMO or Bartlett test was used to test the correlation of variables, and if the correlation coefficient is less than 0.3, there is no sense of dimensionality reduction. If the correlation is strong means that the commonality of variables can be extracted and is suitable for factor analysis.

  1. (2)

    Processing to obtain the standardized matrix,

The standardization is done through the following:

$$Z_{ij} = \frac{{y_{ij} - \hat{y}_{j} }}{{s_{ij} }}(i = 1,2, \cdots ,p).$$
(5)

The standardized matrix is obtained:

$$Z = \left[ {\begin{array}{*{20}c} {z_{1}^{T} } \\ {z_{2}^{T} } \\ \vdots \\ {z_{n}^{T} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {z_{11} } & {z_{12} } & \cdots & {z_{1p} } \\ {z_{21} } & {z_{22} } & \cdots & {z_{2p} } \\ \vdots & \vdots & {} & \vdots \\ {z_{n1} } & {z_{n2} } & \cdots & {z_{np} } \\ \end{array} } \right].$$
(6)
  1. (3)

    Calculate the correlation matrix,

The correlation coefficient matrix is obtained as follows:

$$Z = \left[ {r_{ij} } \right]_{p \times p} = \frac{{Z^{T} Z}}{n - 1}.$$
(7)

In addition,

$$r_{j}^{2} = \frac{{\sum\limits_{i} = 1^{n} (z_{ij} - z_{j} )^{2} }}{n - 1}(i,j = 1,2, \cdots ,p).$$
(8)

The correlation calculation is performed on the standardized matrix Z. The eigenvector values of \(|R - \lambda I_{P} | = 0\) are obtained based on the features of the correlation matrix, and then the common factors are extracted using the above approach, making the information utilization rate cover more than 85%.

  1. (4)

    Calculate the factor load matrix, rotate the load matrix, and obtain the matrix U,

    $$U = \left[ {\begin{array}{*{20}c} {u_{1}^{T} } \\ {u_{2}^{T} } \\ {u_{3}^{T} } \\ {u_{4}^{T} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {u_{11} } & {u_{12} } & \cdots & {u_{1p} } \\ {u_{21} } & {u_{22} } & \cdots & {u_{2p} } \\ \vdots & \vdots & {} & \vdots \\ {u_{n1} } & {u_{n2} } & \cdots & {u_{np} } \\ \end{array} } \right].$$
    (9)

\(u_{i}\) Principal component vector of the i sample. \(u_{ij}\) Projection of the vector on the unit eigenvector.

Support vector machine principle

Support Vector Machine simplifies complex problems by establishing nonlinear mapping relationships is good at dealing with nonlinear complex systems, and automatically establishes the mapping relationship between water quality indicators and evaluation criteria by performing inner product operations in the transformation space to achieve the purpose of effectively classifying the categories to which the predicted samples belong. The principle is shown in Fig. 1.

Figure 1
figure 1

Support vector machine schematic.

The support vector machine consists of three parts: input layer, intermediate inner product kernel function layer, and output layer. The water source discriminant \(X_{1} ,X_{2} ,X_{3} , \cdots ,X_{n}\), which represents the sample feature information, is input into the Support Vector Machine model, and the input variables will be processed by the intermediate inner product kernel function layer to map them into the high-dimensional space to seek the optimal solution. This does not consider the specific mapping relationship in the transformation stretching process, and the discriminant type of the water source is finally output in the output layer after a nonlinear transformation20.

The procedure of SVM classification operation is as follows21,22:

  • Determine the input sample variable as \(\{ x_{i} \} \subset X = R^{n}\), the output variable as \(y_{i} \in Y = \{ 1, - 1\}\).

  • Select the optimal combination of parameters, where the kernel function is \(K\left( {x_{i} ,x} \right) = \varphi \left( {x_{i} } \right) \cdot \varphi \left( x \right)\).

  • Solve \(\min = \frac{1}{2}\sum\limits_{i = 1}^{L} {\sum\limits_{i = 1}^{L} {a_{i} } } a_{j} y_{i} y_{j} K\left( {x_{i} ,x_{j} } \right) - \sum\limits_{i = 1}^{L} {a_{i} }\) according to the constraints.

  • The optimal solution \(a^{*} = (a_{1} ,a_{2} ,a_{3} ,.....a_{n} )\) is obtained from the above calculation.

After dimensioning, assuming a nonlinear mapping \(\varphi :R^{d} \to H\), the optimization problem can be transformed into:

$$\begin{gathered} \mathop {\min }\limits_{w,b} \frac{{\left\| w \right\|^{2} }}{2} \hfill \\ s.t.y_{i} (w \cdot \varphi (x_{i} ) + b) \ge 1,i = 1,2, \cdots l. \hfill \\ \end{gathered}$$
(10)

Introducing Lagrange multipliers yields:

$$L(w,b,a) = \frac{1}{2}\left\| w \right\|^{2} - \sum\limits_{i = 1}^{j} {\alpha_{i} } \left[ {y_{i} \left( {w \cdot \varphi (x_{i} ) + b} \right) - 1} \right].$$
(11)

The pairwise objective function is:

$$\left\{ {\begin{array}{*{20}l} {\max_{n} \sum\limits_{i = 1}^{i} {\alpha_{i} } - \frac{1}{2}\sum\limits_{i = 1}^{l} {\sum\limits_{j = 1}^{l} {a_{i} } } a_{i} y_{i} y_{i} K(x_{i} ,x_{j} )} \hfill \\ {s.k\sum\limits_{i = 1}^{l} {a_{i} } y_{i} = 0} \hfill \\ {a_{i} \ge 0,i = 1,2, \cdots l} \hfill \\ \end{array} } \right..$$
(12)

\(K(x_{i} ,x) = \varphi (x_{i} ) \cdot \varphi (x)\) is a kernel function that implicitly maps the data and then learns it. To obtain the classification decision function:

$$f\left( x \right) = {\text{sgn}} \left( {\sum\limits_{i = 1}^{l} {y_{i} } a_{i} k\left( {x,x_{i} } \right) + b} \right).$$
(13)

The soft interval with the introduction of the penalty factor C and the relaxation variable \(\xi_{i} (\xi_{i} > 0)\) is optimized as:

$$\begin{gathered} \min \frac{1}{2}\sum\limits_{i = 1}^{l} {\sum\limits_{j = 1}^{l} {\alpha_{i} } } \alpha_{j} y_{i} y_{j} K(x_{i} ,x_{j} ) - \sum\limits_{i = 1}^{l} {\alpha_{i} } \hfill \\ 0 \le a_{i} \le C,i = 1,2, \cdots l. \hfill \\ \end{gathered}$$
(14)

The optimal decision function can be obtained as:

$$f(x) = {\text{sgn}} \left( {\sum\limits_{i = 1}^{l} {a_{i} } y_{i} K(x_{i} ,x) + b} \right).$$
(15)

Optimal parameter solving

In this paper, the grid search method is chosen to divide the grid for the optimal search. Using the fixed-step grid search search23, a violent search method with a combination of coarse and fine, and a large step size in the optimization search space, all the real target points to be searched are cyclically arranged and combined, and the value range of c and g are set to [2–10]. The process and principle of the optimization search are shown in Fig. 2.

Figure 2
figure 2

Grid search optimization process.

The support vector machine steps for the optimization of the grid search method are as follows24,25:

  1. (1)

    Create a coordinate grid Set \(X = \left[ {\begin{array}{*{20}c} {X_{1} ,X_{2} } \\ \end{array} } \right]\)\(Y = \left[ {\begin{array}{*{20}c} {Y_{1} ,Y_{2} } \\ \end{array} } \right]\). Set up the training learner, pick the step size L, put in the parameter search range, and the grid parameter node \({\text{c}} = 2X\),\(g = 2Y\).

  2. (2)

    Using K-fold to find the classification accuracy The samples are divided into N subsets, including the test set and the training set, and the number of subsets is 1 and N-1, respectively, where the training set is used for model building. The accuracy evaluation method is set to obtain the classification accuracy corresponding to the set of parameters, which is used for the training set.

  3. (3)

    Traversing the coordinate grid The combination with the smallest mean square error among all the traversed parameters is selected to obtain the optimal trainer, that is, the combination of (c, g) with the highest classification accuracy, and the optimal trainer accuracy is output.

Analysis of water information

Hydrogeologic conditions in the study area

The coal seams in the Zhaogezhuang Coal mine are predominantly distributed within the Upper Taiyuan Formation (Zhaoge Formation) of the Shanxi Formation (Da Miaozhuang Formation). The presence of faults on the eastern, western, southern, and northern boundaries has resulted in the uplift and exposure of the Ordovician limestone due to tectonic activity. This faulting has led to the development of intense structural karst. Consequently, the gently inclined limestone has formed troughs, and a robust karst development zone has emerged along the eastern boundary fault of the Kaiping block. The overlying Quaternary loose layers exhibit coarse particle size, exceptional permeability, and high water content, serving as a prominent conduit for groundwater movement and constituting the primary strong runoff zone in the regional groundwater system. The hydrodynamic forces are notably strong, displaying characteristics of concentrated conduit flow. Furthermore, a portion of the groundwater in the eastern part of the Shahe River basin in the Zhaogezhuang mine infiltrates the field's interior through the Leizhuang fault, with groundwater flowing from the northeast to the southwest.

The Zhaogezhuang Coal Mine has developed five major aquifer systems from the Cambrian to the Quaternary: the Cambrian aquifer, the Ordovician limestone aquifer, the coal-bearing formation sandstone aquifer, the Tangshan limestone aquifer, and the Quaternary alluvial aquifer.The Quaternary alluvial aquifer in the study area exhibits a relatively thin structure, exerting minimal impact on coal mining operations. In contrast, the Cambrian aquifer predominantly interacts with the Ordovician aquifer. Consequently, the Ordovician aquifer assumes a pivotal role in water influx incidents within the study area, particularly in cases of deep water influx. The principal contributors to these occurrences are the aquifers comprising Ordovician limestone and coal-bearing sandstone within the coal-bearing rock series. To maximize differentiation of water source types, the study selected the six most widely distributed ions in groundwater as discriminative indexes26,27. These include Na+, Ca2+ , Mg2+ , Cl, SO42– and HCO3. K+ was combined with Na+ due to their low variation range.

Data index extraction and collection

For data selection, the Zhaogezhuang mine’s deep mining process was primarily threatened by Ordovician carbonate from the Ordovician aquifer, followed by goaf water damage and sandstone water damage. As a result, four water sample types were chosen: goaf water (from the I aquifer), ordovician carbonate (from the II aquifer), sandstone fracture water from the 13 coal system (from the III aquifer), and sandstone fracture water from the 12 coal seam (from the IV aquifer section A). To screen the typical water sample data, 67 groups were selected from 19 boreholes based on the anion and cation balance test and hydrogeological data of Zhaogezhuang. Among these groups, 18 were from goaf water, 13 from ordovician carbonate, 17 from 13 coal seam sandstone fracture water, and 19 from 12 coal seam sandstone fracture water. The four water sample sources are indicated by I, II, III, and IV respectively. The water samples were submitted to the Testing and Analysis Center of Hebei Coalfield Geology Bureau for chemical analysis. The water quality testing report provided analysis of the main ions and the total hardness (TH) using ion chromatography. Additionally, the bicarbonate ion (HCO3) and total alkalinity (TA) were determined through titration using dilute sulfuric acid-methyl orange. The pH value was measured using a pH tester. Subsequently, the data on the nine discriminant indices of the mine water were organized and presented in Table 1(attached).

Table 1 67 groups of water chemistry data.

Using 67 sets of typical water sample data collected from the Zhaogezhuang mining area, 56 of these were utilized as training samples for the learning machine as shown in Table 2(attached) while the remaining 11 sets were reserved as test samples, labeled G1 to G11 as presented in Table 3. The distribution of anion and cation content was illustrated using a three-dimensional diagram, with the cation content distribution depicted in Fig. 3, and the anion content distribution shown in Fig. 4.

Table 2 Training sample data.
Table 3 Forecast sample data.
Figure 3
figure 3

Diagram of anion distribution.

Figure 4
figure 4

Diagram of cation distribution.

Water chemistry characterization

Analysis of statistical characteristic values

The water chemistry statistical characteristic values were calculated and analyzed based on the water chemistry content information of 67 groups of water samples from Zhaogezhuang mine. In the water sample data of study area, the goafwater is obviously different from the other three types of water samples in ionic composition. Among the anions of the goaf water, the anion with the highest content is SO42–, which is 78.022 mmol·L–1, while the other water samples are HCO3. The goaf water is easier to identify than the other three types of water sources, and can be identified by the content of anions, if the highest content of SO42– can be initially classified as goaf water; in the cations, the highest content in all four types of water samples is Ca2+. In addition, in terms of the overall content of anions and cations in all water samples data, the content of Ca2+  and HCO3 is higher compared to other ions, which indicates that Ca2+ and HCO3 have strong recognition ability.

The goaf water

The hydrochemical index of goaf water are as shown in Table 4. The water chemical composition of the four water samples from Zhaogezhuang differed significantly, and their mass concentrations of substances were related to the water source cycle. In the goaf water, the mass concentration of SO42– was the highest in the distribution of anion content, and its substance concentration ranged from 60.47 mmol·L–1 to 85.55 mmol·L–1, accounting for 78% of the anions, followed by HCO3. Cl had the smallest mass concentration. The cations were mainly Ca2+ and Mg2+, and the lowest mass concentration of Na+. The coefficient of variation is the ratio of the standard deviation to the mean, indicating the degree of dispersion of the data, and the Cl coefficient of variation was the largest at 0.9, followed by Na+  at 0.41, and the rest were smaller, indicating the poor uniformity of ion concentration in the water.

Table 4 Hydrochemical index of goaf water.
Ordovician carbonate

The hydrochemical index of Ordovician Carbonate are as shown in Table 5.The ph of ordovician carbonate is 7.30–7.94, which is weakly alkaline. 86.6% of the anions in ordovician carbonate are mainly HCO3 and SO42–, and the mass concentration of cations are: Ca2+ > Mg2+ > Na+, mainly Ca2+ and Mg2+ accounting for 92.88%, and the water chemistry type is Ca-Mg-HCO3. The variation coefficient of ordovician carbonate is in the following order: SO42– > Cl > Na+ > Mg2+ > HCO3 > Ca2+, and the coefficients of variation of all six indexes are less than 0.5. and the coefficients of variation of the anions Cl, SO42–, HCO3 is greater than that of cations Na+, Mg2+, Ca2+.

Table 5 Hydrochemical index of Ordovician carbonate.
Sandstone fracture water from the 13 coal system

The hydrochemical index of sandstone fracture water from 13 coal system are as shown in Table 6.The highest mass concentration of HCO3 among the anions in the fracture water of the 13-coal sandstone is up to 79.58 mmol·L–1, the content of SO42– and Cl is less, and the highest mass concentration of cations is Ca2+, followed by Mg2+. The 13 coal system sandstone fracture water coefficient of variation is not much different except for Na+, which is less than 0.1, and the ion concentration is dispersed more uniformly.

Table 6 Hydrochemical index of sandstone fracture water from the 13 coal system.
Sandstone fracture water from the 12 coal system

The anions in the fracture water of the 12 coal seam sandstone are mainly HCO3with a mean mass concentration of 71.79 mmol·L–1. The cations are dominated by Ca2+ up to 64.36 mmol·L–1, followed by Mg2+ with a mean concentration of 32.57 and finally Na+. The variation coefficients of sandstone fracture water in the 12 coal seam are in the following order: Mg2+  > Ca2+ > Na+  > Cl > HCO3, and the variation coefficient of Mg2+ is as high as 0.69.

The hydrochemical index of sandstone fracture water from 12 coal system are as shown in Table 7. In order to study the hydraulic connection between individual aquifers, the degree of connection K between them can be calculated quantitatively28,29, and since the Cl concentration is minimally disturbed by other factors and is mainly influenced by the formation itself, the degree of hydraulic connection between two aquifers can be obtained by calculating the difference between their average Cl concentrations .If the K value of the hydraulic connection between the two aquifers is less than 0.2, it means that they have a strong hydraulic connection, if K is greater than 0.4, it means that the hydraulic connection between the two aquifers here is weak, if the final calculated K value is between 0.2 and 0.4, it means that the hydraulic connection is moderately strong30,31.

$$K = 0.5 \times \frac{{Cl_{1} - Cl_{2} }}{{(Cl_{1} + Cl_{2} )}}.$$
(16)
Table 7 Hydrochemical index of sandstone fracture water from the 12 coal system.

Cl1 The average Cl concentration in aquifer 1. Cl2 The average Cl concentration in aquifer 2.

Through Eq. (16), the K values of goaf water and Ordovician carbonate, sandstone fracture water of 13 coal system and sandstone fracture water of 12 coal system are all 0.25, and the degree of hydraulic connection is moderate. The K value of the hydraulic connection between the goaf water and the sandstone fracture water of 13 coal system is 0.025, and the K value of the fracture water with the 12 coal seam sandstone is 0.03, which is a weak hydraulic connection; the K value of the fracture water with the 13 coal system sandstone and the 12 coal seam sandstone fracture water is 0.001, which is a very weak hydraulic connection. It can be summarized that there is a certain hydraulic connection between the goaf water and other aquifers, indicating the existence of connection and increasing the difficulty of discrimination.

Piper trilinear diagram analysis

The hydrogeological conditions in Zhaogezhuang Coal Mine are characterized by complexity and variability. As demonstrated by the previous analysis of the goaf water composition and other water sources, they exhibit distinguishable differences. To further investigate the distribution patterns of aquifer water samples, the Piper trilinear diagram method was employed for analysis. The ion contents were represented as points on the diagram, allowing for inference of the water chemistry type and quality pattern of the aquifer based on the scatter position of the water samples.

The water samples of the study area were drawn for hydrochemistry analysis using piper trilinear diagram shown in Fig. 5. The goaf water was located in the upper right corner, near Ca2+, Mg2+ and SO42-, Cl, mainly Ca·Mg-Cl·SO4 type, and individually Ca·Mg-SO4 type. The water sample of Ordovician carbonate water is located in the left position of the diamond-shaped area, and the water quality type is Ca·Mg-HCO3 type. By observing the left triangle area, we can find that the cations in the Ordovician carbonate sample are mainly Mg2+ and Ca2+, and the anions are mainly HCO3 and SO42– in the right triangle area. Sandstone fracture water from the 13 coal system is located in the middle and left position, and the cations are mainly located in Ca2+ and The anions are scattered in the end elements with high proportion of HCO3 and SO42–, and the water quality type is Ca·Mg-HCO3 type. sandstone fracture water samples from the 13 coal system are highly similar to the 13 in the trilinear diagram, and the water chemistry type is Ca·Mg-HCO3 type, the cations are mainly Ca2+ and Mg2+, and the anions are mainly HCO3 and CO32–. In summary, the water quality types of Ordovician carbonate, sandstone fissure water from 13 or 12 coal seam are the same, with overlapping characteristics and inconspicuous distribution boundaries, which need further quantitative discrimination.

Figure 5
figure 5

Hydrochemistry analysis trilinear diagram.

Model building and application

Dimensionality reduction based on R-factor

The normalization process is performed before the operation to make it lie in the interval of [0, 1] to solve the comparability between indicators and ensure the stability of calculation.The normalization of water sample data are as shown in Table 8 (attached).

Table 8 Normalization of water sample data.

There is a non-linear association between the indicators, and to reduce the correlation between the data, the optimal number of common factors for the six indicators of sodium ion, calcium ion, magnesium ion, chloride ion, sulfate ion, and bicarbonate ion was determined to be 3, denoted as Y1, Y2, and Y3. SPSS software was used to analyze 67 groups of samples and 6 evaluation indicators of Zhaogezhuang based on the correlation calculation steps of R-type factors. The eigenvalues and contribution rates of the main factors were as Table 9.

Table 9 Characteristic values and contribution rates of main factors.

The cumulative contribution rate of the first three principal factors reaches 96.660%, which indicates that the factors extracted by dimensionality reduction contain 96.660% of the information of the original index data. When the cumulative contribution rate reaches 80%, it shows that the extracted principal factors are reasonable and effective, which indicates that these three principal factors cover most of the water chemistry information and can effectively replace the original indexes.

The factor correlation matrix is as follows:

$$A = \left[ {\begin{array}{*{20}c} {1.000} & { - 0.416} & { - 0.167} & {0.231} & { - 0.104} & {0.080} \\ { - 0.416} & { - 1.000} & { - 0.799} & {0.393} & { - 0.362} & {0.286} \\ { - 0.167} & { - 0.799} & {1.000} & {0.589} & {0.000} & { - 0.379} \\ {0.231} & {0.393} & { - 0.589} & {1.000} & { - 0.866} & {0.79} \\ { - 0.104} & { - 0.362} & {0.480} & { - 0.899} & {1.000} & { - 0.987} \\ {0.080} & {0.286} & { - 0.379} & {0.791} & {0.987} & {1.000} \\ \end{array} } \right].$$
(17)

The correlation coefficient above 0.8 indicates a strong correlation, while between 0.3 and 0.8 indicates a moderate correlation, and below 0.3 indicates no correlation. The correlation coefficient between Na+  and Ca2+  is − 0.416, indicating a weak correlation, while with Mg2+  is − 0.167, with Cl is 0.231, with SO42– is − 0.104, and with HCO3 is 0.080, all of which have no correlation. The correlation coefficient between Ca2+ and Mg2+ is − 0.799, indicating weak correlation between Ca2+ and other ions. Similarly, Mg2+ is not correlated with Na+  and weakly correlated with other ions, while Cl and SO42– are strongly correlated and SO42– and HCO3 are strongly correlated.

Using the maximum variance orthogonal rotation method, SPSS rotates to obtain the rotated component matrices. The factor loading matrix and the rotated component matrix were:

$$Z^{\prime}_{(3 \times 6)} = \left| {\begin{array}{*{20}c} {0.101} & {0.788} & {0.600} \\ {0.622} & { - 0.754} & {0.180} \\ { - 0.755} & {0.333} & { - 0.364} \\ {0.922} & {0.215} & { - 0.010} \\ { - 0.929} & { - 0.221} & {0.182} \\ {0.872} & {0.25} & { - 0.384} \\ \end{array} } \right|\quad Z^{\prime}_{(3 \times 6)} \equiv \left| {\begin{array}{*{20}c} {0.096} & { - 0.053} & {0.990} \\ { - 0.286} & { - 0.892} & { - 0.398} \\ {0.279} & { - 0.933} & { - 0.393} \\ { - 0.838} & { - 0.267} & {0.251} \\ { - 0.873} & { - 0.211} & { - 0.022} \\ {0.978} & {0.263} & { - 0.06} \\ \end{array} } \right|.$$

The component conversion matrix is:

$$Z^{\prime\prime}_{(3 \times 3)} = \left[ {\begin{array}{*{20}c} {0.835} & {0.548} & {0.055} \\ {0.338} & { - 0.589} & {0.734} \\ { - 0.434} & {0.594} & {0.677} \\ \end{array} } \right].$$

Three new main components Y1, Y2, and Y3 were extracted, and the factor score coefficient matrix based on SPSS operations was as follows:

$$U = \left[ {\begin{array}{*{20}c} { - 0.072} & {0.079} & {0.838} \\ { - 0.108} & {0.521} & { - 0.241} \\ {0.152} & { - 0.608} & { - 0.264} \\ {0.277} & {0.052} & {0.116} \\ { - 0.410} & { - 0.204} & { - 0.128} \\ \end{array} } \right],$$

According to the factor score coefficient matrix, the expressions of the main factors Y1, Y2, and Y3 are:

$$\left\{ {\begin{array}{*{20}l} {Y_{1} = - 0.072X_{1} - 0.108X_{2} + 0.152X_{3} + 0.277X_{4} - 0.410X_{5} } \hfill \\ {Y_{2} = 0.079X_{1} + 0.521X_{2} - 0.608X_{3} + 0.052X_{4} - 0.204X_{5} } \hfill \\ {Y_{3} = 0.838X_{1} - 0.241X_{2} - 0.264X_{3} + 0.116X_{4} - 0.128X_{5} } \hfill \\ \end{array} } \right..$$

The original data of water samples (I), water samples (II), water samples (III), and water samples (IV) from Zhaogezhuang mine were substituted into the model expressions of the three main factors Y1, Y2, and Y3, and the factor score matrices were as follows:

$$\left( \mu \right)_{18 \times 3} { = }\begin{array}{*{20}l} {\left[ {\begin{array}{*{20}l} { - 2.427} \hfill & {2.841} \hfill & {0.952} \hfill \\ { - 1.011} \hfill & { - 0.465} \hfill & { - 1.304} \hfill \\ { - 0.933} \hfill & { - 0.635} \hfill & {1.059} \hfill \\ { - 1.374} \hfill & { - 1.276} \hfill & { - 0.770} \hfill \\ { - 0.621} \hfill & { - 0.066} \hfill & {0.284} \hfill \\ { - 1.476} \hfill & { - 0.164} \hfill & { - 0.689} \hfill \\ { - 1.511} \hfill & { - 0.660} \hfill & { - 0.801} \hfill \\ { - 1.636} \hfill & { - 1.159} \hfill & {2.954} \hfill \\ { - 1.389} \hfill & { - 1.881} \hfill & { - 0.448} \hfill \\ { - 1.349} \hfill & { - 1.170} \hfill & { - 1.730} \hfill \\ { - 1.441} \hfill & { - 1.052} \hfill & { - 0.711} \hfill \\ { - 1.506} \hfill & { - 1.012} \hfill & { - 0.163} \hfill \\ { - 1.518} \hfill & { - 0.835} \hfill & { - 0.234} \hfill \\ { - 1.377} \hfill & { - 0.355} \hfill & {1.104} \hfill \\ { - 1.516} \hfill & {0.228} \hfill & {1.266} \hfill \\ { - 1.609} \hfill & { - 0.066} \hfill & {1.085} \hfill \\ { - 1.699} \hfill & { - 0.414} \hfill & {0.667} \hfill \\ { - 1.623} \hfill & { - 0.492} \hfill & {0.284} \hfill \\ \end{array} } \right]} \hfill \\ \end{array} \left( \mu \right)_{18 \times 3} { = }\left[ {\begin{array}{*{20}c} { - 0.607} & { - 0.156} & { - 0.731} \\ { - 0.497} & { - 0.107} & { - 0.901} \\ { - 0.556} & {0.111} & { - 1.160} \\ {0.280} & {1.199} & { - 0.906} \\ { - 0.464} & { - 0.140} & { - 0.932} \\ { - 0.693} & {0.626} & { - 1.268} \\ { - 0.672} & {0.281} & { - 0.833} \\ {0.373} & {1.885} & {0.727} \\ { - 0.049} & {2.636} & {0.921} \\ { - 0.021} & {2.541} & {1.268} \\ {0.257} & {1.968} & {1.211} \\ {0.085} & {1.734} & {1.449} \\ {0.837} & { - 0.732} & {0.547} \\ \end{array} } \right].$$
$$(\mu )_{17 \times 3} = \left[ {\begin{array}{*{20}r} \hfill {1.074} & \hfill { - 0.886} & \hfill { - 0.960} \\ \hfill {0.947} & \hfill { - 0.2529} & \hfill { - 0.283} \\ \hfill {0.944} & \hfill { - 0.906} & \hfill {0.179} \\ \hfill {1.065} & \hfill { - 0.943} & \hfill { - 0.228} \\ \hfill {0.581} & \hfill {1.069} & \hfill { - 2.249} \\ \hfill {0.812} & \hfill { - 0.581} & \hfill {0.250} \\ \hfill {0.916} & \hfill { - 0.863} & \hfill {0.364} \\ \hfill {0.963} & \hfill { - 0.729} & \hfill {0.105} \\ \hfill {0.910} & \hfill { - 0.410} & \hfill { - 0.428} \\ \hfill {0.963} & \hfill { - 1.004} & \hfill {0.100} \\ \hfill {0.927} & \hfill { - 0.694} & \hfill {1.162} \\ \hfill {0.852} & \hfill { - 0.918} & \hfill {1.804} \\ \hfill {1.005} & \hfill { - 1.054} & \hfill {1.200} \\ \hfill {1.068} & \hfill { - 0.700} & \hfill { - 0.182} \\ \hfill {0.993} & \hfill { - 1.254} & \hfill {1.103} \\ \hfill {0.982} & \hfill { - 1.108} & \hfill {1.684} \\ \hfill {1.030} & \hfill { - 0.890} & \hfill {1.251} \\ \end{array} } \right](\mu )_{19 \times 3} = \left[ {\begin{array}{*{20}r} \hfill {0.538} & \hfill {0.679} & \hfill { - 0.904} \\ \hfill {0.633} & \hfill {0.212} & \hfill { - 0.196} \\ \hfill {0.439} & \hfill {0.826} & \hfill {0.211} \\ \hfill {0.560} & \hfill {0.512} & \hfill { - 0.055} \\ \hfill {0.521} & \hfill {0.499} & \hfill {0.232} \\ \hfill {0.526} & \hfill {0.707} & \hfill { - 0.272} \\ \hfill {0.549} & \hfill {0.595} & \hfill { - 0.146} \\ \hfill {0.669} & \hfill {0.177} & \hfill { - 0.255} \\ \hfill {0.980} & \hfill { - 0.512} & \hfill {1.176} \\ \hfill {0.796} & \hfill { - 0.055} & \hfill { - 0.383} \\ \hfill {0.714} & \hfill {0.188} & \hfill { - 0.011} \\ \hfill {0.933} & \hfill { - 0.242} & \hfill { - 1.267} \\ \hfill {0.458} & \hfill {0.282} & \hfill { - 1.702} \\ \hfill {0.556} & \hfill {0.662} & \hfill {0.174} \\ \hfill {0.805} & \hfill {0.128} & \hfill { - 1.836} \\ \hfill {0.540} & \hfill {0.519} & \hfill { - 0.495} \\ \hfill {0.564} & \hfill {0.480} & \hfill {0.925} \\ \hfill {0.436} & \hfill {0.952} & \hfill { - 0.626} \\ \hfill {0.466} & \hfill {1.077} & \hfill { - 1.002} \\ \end{array} } \right].$$

R-SVM model establishment

The R- SVM model is shown in Fig. 6. First, the R-factor is used to initially reduce the dimensionality of the data, and the three common factors Y1, Y2, and Y3 are used as the input variables of the model, and the four types of water sources H are used as the output of the model to establish the mapping \(F({\text{Y1,Y2,Y3}}) \to H\), which automatically searches for complex connections between the input variables and the types of water sources. The grid search method is used to find the optimal combination of parameters for the Support Vector Machine model. The training set data is then used to train the model, and the trained model is used to predict the water sample types for the testing set data. The predicted types are then compared with the actual types to correct for any deviations. This process is repeated until the model achieves a satisfactory level of accuracy in predicting the types of water samples.

Figure 6
figure 6

Water source discrimination of R-SVM.

Parameter search and model application

Six indicators of sodium ion, calcium ion, magnesium ion, chloride ion, sulfate ion and bicarbonate ion are used as input variables of the SVM, and four water source types of goaf water, Ordovician carbonate, sandstone fracture water from the 13 coal system and sandstone fracture water from the 12 coal system are used as outputs of the model to establish the mapping relationship between the two and seek the nonlinear law of the two by SVM. Firstly, 55 sets of training samples and 11 sets of prediction samples are substituted into the grid search method to run the search for parameters, and the range of values of the parameters c and g of the grid search method are set \({\text{g}} \in \left[ {2^{ - 10} ,2^{10} } \right]\) \({\text{c}} \in \left[ {2^{ - 10} ,2^{10} } \right]\), and the step size L = 0.2 according to the operation process of SVM.

The three public factors of Zhaogezhuang after dimensionality reduction were used as the input variables of the model, and four types of goaf water, Ordovician carbonate, sandstone fracture water from the 13 coal system, and sandstone fracture water from the 12 coal system of Zhaogezhuang mine were used as the outputs of the model to establish the mapping relationship about the public factors and water source types. The factor scores of the 67 sets of sample data after dimensionality reduction were substituted into the SVM model of grid search method for finding the best model for training, and the best parameter combination c = 1 and g = 2.8284 was finally obtained.The result of the optimization search is shown in Fig. 7

Figure 7
figure 7

Grid Optimization after dimensionality reduction.

Substituting c = 1 and g = 2.8284 into the SVM model, the type attributes were predicted for 11 sets of data to be discriminated, and the final results are shown in Fig. 8 and Table 10. The model misjudged Type II ordovician carbonate as Type III sandstone fracture water from the 13 coal system, indicating that the model is suitable for water source discrimination in Zhaogezhuang Coal Mine and can effectively make the distinction.

Figure 8
figure 8

Classification prediction diagram after dimensionality reduction.

Table 10 Comparison of model operation results.

Table 11 presents a comparative analysis of model performance across different optimization types. The accuracy and precision metrics were employed to evaluate the models' efficacy. The Fisher optimization type exhibits the lowest performance in terms of accuracy and precision. The Grid optimization type shows a significant improvement in both accuracy and precision compared to the Fisher type. Notably, the R-type grid optimization type demonstrates the highest level of performance, surpassing both the Fisher and Grid types in terms of accuracy and precision.

Table 11 Comparison of model performance.

Based on the information provided, it seems that the coupled discriminant model of R-SVM was able to provide more targeted and effective characterization of water sources compared to other multi-model prediction results presented in Table 11. The R-factor simplification was used as a new discriminant to improve the model’s independence component. The coupled discriminant model of R-SVM can also complement the qualitative analysis of water chemistry and provide rapid identification of water sources.

Conclusion

As coal mine of submarine mining, the identification and prediction of mine water inrush source is of great significance to the safety and efficiency of mine production in Zhaogezhuang Coal Mine. In order to prevent and control the water inrush, it is of great practical significance to identify the mine water source effectively and accurately. Through the analysis of the water source data of different parts in the mine, the effective water source discrimination model was established to verify its effectiveness and practicability.The conclusions of the study are as follows:

  1. (1)

    The chemical composition data of 67 water samples of Zhaogezhuang Coal Mine were collected. According to the chemical composition analysis of selected mine water sources, the main ions identified in water sources were Na+, Ca2+, Mg2+, Cl, SO42– and HCO3. The water inrush sources in the mining area were divided into four categories: goaf water was type I, ordovician carbonate was type II, sandstone fracture water from 13 coal seam was type III, and from 12 coal seam was type IV. The analysis and comparison of water source information provide support for the establishment of water source discrimination model.

  2. (2)

    R factor analysis was used to reduce the dimensionality of the original data, resulting in three common factors (Y1, Y2, and Y3) and factor score data for water source data. This approximation of indicator attributes filtered out redundant features and improved efficiency.

  3. (3)

    The coupled model of R-SVM achieved a classification accuracy of 90.90% in water source discrimination for the Zhaogezhuang mine. Compared to traditional qualitative approaches, this model explores the internal laws of the data and provides accurate discrimination, improving upon the Fisher discrimination function and SVM model alone.