Introduction

One of the most common and dangerous natural risks associated with coal mining is methane, which can mix with air and cause disasters. Extraction of gases from coal mines is a fundamental measure taken to prevent and control disasters and accidents1,2. Drainage boreholes are used to extract gas from coal seams3. However, the concentration of gas extracted from coal seams by boreholes in China is generally low because of leaks. Air flows through a channel into a borehole and reduces the negative pressure to enable gas extraction. As a result, low concentrations of gas are extracted by boreholes4,5. The effective identification of the presence and types of gas leaks is vital to improve the efficiency of gas extraction.

Studies of the mechanism of borehole leakage have led to the development of physical models. Zhang T6 explained that air leakage was caused by a local change in the strain7,8 around a borehole. Zhang C9 studied the mechanism of air leakage in the cracks around a borehole and concluded that the leakage mechanisms of fractures around boreholes differed depending on the extraction stage. This insight provided a theoretical basis for the classification and identification of leaks in boreholes. To further analyse the flow state and characteristic changes of air leaks in boreholes, some scholars constructed a physical model to determine the mechanism of leakage. Zhang J10,11 combined numerical simulations of the leakage mechanism around a borehole in coal with the rheological and viscoelastic–plastic characteristics of coal to build a dynamic leakage model of the borehole. Based on an analysis of flow coupling between methane and air in borehole fractures, Fan J12 constructed a flow model of air leakage coupling components in boreholes by using the finite difference method (FDM). Zhang Y13 constructed a physical model of air leakage in boreholes and classified three types of leaks according to their source, i.e., roadway fissure zones, borehole fissure zones, and materials used in sealing sections of boreholes. Wang Z14 analysed the mechanism of air leakage from boreholes by numerical simulation and established a dynamic leakage model of drainage boreholes. Wang H15 and Zhang Y16 discussed the influence of air leakage on gas concentration by studying the influence of factors around roadways and boreholes and constructed an air-gas mixed-flow coupling model. Their physical model explained the mechanisms of gas extraction and air leakage in boreholes and provided a theoretical basis for the classification of air leaks from boreholes. However, the construction of a physical model of air leakage in a drilling hole is complicated and cannot be applied quickly to guide field practice. Therefore, there is still a need for an efficient mathematical model for identification of leaks.

Advances in computer science, applied mathematics and artificial intelligence have promoted in-depth research on identification models for use in coal mining17,18,19,20,21,22. However, the algorithms used to construct these models are subject to limitations. The hierarchical cluster analysis method cannot redistribute existing data and has a small number of iterations. The chaotic immune particle swarm optimization-probabilistic neural network (CIPSO-PNN) optimizes the PNN, but the process of finding the best solution is long, and the model is complex. In Fisher's discriminant analysis, the number, representativeness and correctness of the learned samples directly impact the recognition accuracy of the model. In addition, the algorithms of existing discrimination classification models cannot adapt to differences in the relationships of various data characteristics with multiparameter nonlinearity, which is important for discrimination of leaks in extraction boreholes. naïve Bayes classification, a classification method based on the Bayes principle and independent assumption of feature conditions, has stable classification efficiency23. Compared with the above classification algorithms, decision trees and artificial neural networks perform better on small amounts of sample data and have the minimum error rate, and they have been widely used in coal mines24,25. Therefore, they have been applied to identify in air leaks in gas boreholes. However, because of low sensitivity to linear data, improvements are needed.

In summary, research is now relatively mature for the development of models for leaks in boreholes for gas extraction based on studies of the mechanism of air leakage, and models are widely used in the field of coal mining. However, research is limited on using machine learning methods to analyse multisource characteristic information about air leakage and establish a mathematical model for the recognition of leaks from boreholes. In this study, we collected data for leaks from boreholes and applied multisource data fusion theory (MDF) and principal component analysis (PCA). We also improved the traditional naive Bayesian classification (NBC) system and established mathematical models to identify types of air leaks from boreholes. In this study, this model fills a gap by supporting an algorithm to identify types of leaks in boreholes used to extract gases from coal mines, provides a theoretical basis and accurate guidance for the evaluation of the quality of the sealing of boreholes and borehole repairs, and supports the improvement of the application of boreholes to extract gases from coal mines.

Construction of an improved naive Bayesian model for the identification of air leaks from gas drainage boreholes

Feature information selection

According to previous studies4,15,26,27,28,29,30,31, air leaks from gas drainage boreholes can be divided into the three types shown in Table 1.

Table 1 Types of borehole leaks.

In Fig. 1, there are many cracks in the coal seam. Due to the poor sealing effect, air from the roadway enters the borehole through cracks in the coal seam, which leads to the leakage of the borehole. In addition, the connections between the extraction pipes are not close, which results in low extraction concentrations. In this paper, according to the actual situation of the gas drainage borehole in the 229 working face of the Xiashijie Coal Mine, eight characteristics can reflect the gas drainage effect of the borehole, including A1: extraction flow, A2: gas concentration at 0 m, A3: gas concentration at 2 m, A4: gas concentration at 6 m, A5: gas concentration at 9 m, A6: gas concentration at 12 m, A7: negative pressure at the orifice and A8: negative pressure at the extraction, and they are used in the model for the identification of leaks in boreholes for gas extraction.

Figure 1
figure 1

Air leakage characteristics of a borehole for extraction of gas.

Identification model construction

A naive Bayes classifier (NBC) and the eight characteristics (above) were used as the main theory for model construction. Since the NBC could not accommodate the missing data for air leakage in gas extraction boreholes, and since the identification and classification accuracy of information with strong correlations is not high, some data easily have a greater impact on the overall model32. As shown in Fig. 2, by using MDF and principal component analysis (PCA) to improve the traditional NBC, a model for the identification of air leaks from a borehole for gas extraction was established as follows:

Figure 2
figure 2

Flow chart of building the model.

Data preprocessing

The existing m-dimensional sample data of gas drainage borehole leakage \({\mathbf{x}} = \left( {x_{1} ,x_{2} \ldots x_{m} } \right)\) with n independent observations, \(\left( {{\mathbf{x}}_{{\mathbf{1}}} {\mathbf{,x}}_{{\mathbf{1}}} {\mathbf{ \ldots x}}_{{\mathbf{n}}}^{{\text{T}}} } \right)\), is used as the observation sample to build the gas drainage borehole leakage data matrix:

$${\mathbf{X}} = \left[ {{\mathbf{x}}_{{\mathbf{1}}} {\mathbf{,x}}_{{\mathbf{2}}} {\mathbf{ \ldots x}}_{{\mathbf{n}}} } \right]^{{\mathbf{T}}} = \left[ {\begin{array}{*{20}c} {x_{11} } & {x_{12} } & \ldots & {x_{1m} } \\ {x_{21} } & \ldots & \ldots & \ldots \\ \ldots & \ldots & {x_{ij} } & \ldots \\ {x_{n1} } & \ldots & \ldots & {x_{nm} } \\ \end{array} } \right]$$
(1)

\({\mathbf{x}}_{{\mathbf{i}}} = \left( {x_{i1} ,x_{i2} \ldots x_{im} } \right)\) represents the observation sample of group i, i = 1,2…,n, and\(x_{ij}\) represents the jth variable of the ith group of observation samples, where j = 1,2…,m.

Following MDF theory, the training samples for gas drainage borehole leaks are processed at the data level33, and the processed data are standardized with Eqs. (2)–(4).

$$x_{ij}^{1} = \frac{{x_{ij} - \overline{x}_{j} }}{{\sqrt {s_{jj} } }}\quad i = 1,2....{\text{n}};\;\;j = 1,2....{\text{m}}$$
(2)
$$\overline{x}_{j} = \frac{1}{n}\sum\limits_{i = 1}^{n} {x_{ij} }$$
(3)
$$s_{ij} = \frac{1}{m - 1}\sum\limits_{j = 1}^{m} {\left( {x_{ij} - \overline{x}_{j} } \right)^{2} }$$
(4)

In Eqs. (2)–(4), \(x_{ij}^{1}\) represents the standardized single sample data, \(\overline{x}_{j}\) represents the sample mean for the same characteristic information, and \(s_{jj}\) represents the covariance of single sample data. Equations (2)–(4) can eliminate the influence of the data dimension. The standardized gas drainage borehole leakage data are still expressed in X.

Principal component selection

The correlation coefficient matrix of the gas drainage borehole leakage sample data after standardization is calculated as follows:

$$R = \left[ {r_{ij} } \right]_{n*n} = \frac{1}{m}XX^{{\mathbf{T}}}$$
(5)

where

$$r_{ij} = \frac{1}{m - 1}\sum\limits_{i = 1}^{m} {x_{il} x_{lj} \quad i,j = 1,2....{\text{n}}}$$
(6)

The characteristic equation of the sample correlation matrix R is obtained with k eigenvalues and the corresponding k unit eigenvectors:

$$\begin{gathered} \left| {{\mathbf{R}} - \lambda {\mathbf{I}}} \right| = 0 \hfill \\ \lambda_{1} \ge \lambda_{2} \ge \lambda_{3} \ge \ldots \lambda_{m} \hfill \\ \end{gathered}$$
(7)

In Eq. (7), \(\lambda\) is the characteristic value of the characteristic equation corresponding to the characteristic information, and the values are sorted according to the size of the characteristic value, from large to small.

The cumulative contribution rate and cumulative variance contribution rate are calculated as follows:

$$z = \frac{{\lambda_{k} }}{{\sum\limits_{i}^{m} {\lambda_{m} } }}$$
(8)
$$z_{i} = \sum\limits_{j = 1}^{k} {\left( {\frac{{\lambda_{k} }}{{\sum\limits_{i}^{m} {\lambda_{m} } }}} \right)}$$
(9)

The principal component \(z_{i} \ge 85\%\) is determined to reduce the dimensionality and eliminate information overlap.

Construction of new indexes

The unit eigenvector corresponding to the first k principal components is obtained:

$$a_{i} = \left( {a_{1i} ,a_{2i} \ldots a_{ni} } \right)^{{\mathbf{T}}} ,\;\;i = 1,2,3 \ldots {\text{k}}$$
(10)

Linear transformation with k unit eigenvectors as coefficients yields:

$$Y_{i} = a_{i}^{T} {\mathbf{x}}\quad i = 1,2,3 \ldots {\text{k}}$$
(11)

That is, after orthogonal transformation, potentially correlated variables or influencing factors in the gas drainage borehole leakage data are linearly combined to obtain a set of new linear irrelevant variables, simplify the data structure, extract the data characteristics, and construct a new improved naive Bayes identification index.

As shown in Fig. 3, for the characteristic information obtained in the lower section, the original eight-dimensional sample characteristic information (A1, A2…, A8) is converted into a new p-dimensional identification index (Y1, Y2…, Yk) k < 8. The associated characteristic information is combined and retains most of the information of the original variables34 while eliminating overlapping information.

Figure 3
figure 3

Construction of the new indexes for air leakage from a gas drainage borehole.

Modelling

The data matrix \({\mathbf{Y}} = [{\mathbf{y}}_{1} ,{\mathbf{y}}_{2} ...{\mathbf{y}}_{k} ]\) is constructed according to the new leakage index of the drainage borehole. Among them,\(y_{i} = [y_{1} ,y_{2} ...y_{n} ]^{{\mathbf{T}}}\), \(y_{i}^{\left( j \right)}\) is the jth feature of sample i, \(y_{t}^{\left( j \right)} \in \{ a_{j1} ,a_{j2} , \ldots a_{jsn} \}\), and \(a_{jl}\) is the possible value of the jth feature, i = 1, 2, 3… n; j = 1, 2, 3… k; l = 1, 2, 3… sn. The sample category is \(G = \{ g_{1} ,g_{2} \ldots g_{T} \}\), \(y_{i} \in \{ g_{1} ,g_{2} \ldots g_{T} \}\).

The prior probability and conditional probability of the air leakage category of the extraction borehole are calculated. Because the characteristic information data optimized by the principal component are normally distributed, the Gaussian function is used to determine the conditional probability, as shown in Eqs. (12)–(13).

$$P\left( {Y = {\text{g}}_{t} } \right) = \frac{{\sum\limits_{i = 1}^{n} {I\left( {{\text{y}}_{i} = {\text{g}}_{t} } \right)} }}{n}\quad {\text{t = 1,2,3}} \ldots {\text{T}}$$
(12)
$$P\left( {{\text{y}}_{i}^{(j)} = {\text{a}}_{{jl_{jl} }} \left| {Y = g_{t} } \right.} \right) = \frac{1}{{\sqrt {2\sigma_{{Y = g_{t} }}^{2} } }}e^{{\frac{{ - \left( {q_{{y^{(j)} = a_{jl} }} - u_{{y = g_{t} }} } \right)}}{{2\sigma_{{Y = g_{t} }}^{2} }}}}$$
(13)

In Eq. (13), \(u_{{y = g_{t} }}\) is the normalized expected value of the sample data of category \(g_{t}\); \(\sigma_{{Y = g_{t} }}^{{}}\) is the normalized variance of the sample data of category \(g_{t}\). The posterior probability is calculated for the given leakage sample data \(y_{i} = [y_{1} ,y_{2} ...y_{n} ]^{{\mathbf{T}}}\) of the extraction borehole.

$$P\left( {Y = g_{t} } \right)\prod\limits_{j = 1}^{k} {P(Y^{(j)} = y^{(j)} \left| {Y = g_{t} } \right.)}$$
(14)

The category of an actual case is determined, and the probability model of gas leakage identification of the extraction borehole is built as shown in Eq. (15):

$$G_{{y_{i} }} = \arg \mathop {\max }\limits_{{g_{t} }} P(Y = g_{t} )\prod\limits_{j = 1}^{k} {P(Y^{(j)} } = y^{(j)} \left| {Y = g_{t} } \right.)$$
(15)

where \(G_{{y_{i} }}\) is the maximum posterior probability value of the corresponding category of the leakage of the extraction borehole.

In the actual extraction process, there are gas drainage boreholes with good drainage effects. When the sealing effect is good, the difference in gas concentration at various positions is small. Combined with the air leakage characteristics of the drainage borehole, the gas concentration at different positions in the drainage borehole is defined as \(C_{{y_{i} }}^{\left( b \right)}\), i = 1, 2… n, for borehole gas concentration positions b = 0, 1, 2…,b.

$$\frac{{C_{{y_{i} }}^{{\left( {b - 1} \right)}} }}{{C_{{y_{i} }}^{\left( b \right)} }} \ge \frac{{C_{{y_{i} }}^{\left( 0 \right)} }}{{C_{{y_{i} }}^{\left( b \right)} }} \ge 90\%$$
(16)

The corresponding borehole is a borehole with a good gas drainage effect, and there is no need to evaluate the type of leakage. Incorporating Eq. (15), the gas drainage borehole leakage identification model can be constructed as follows:

$$\left\{ {\begin{array}{*{20}l} {\frac{{C_{{y_{i} }}^{{\left( {b - 1} \right)}} }}{{C_{{y_{i} }}^{\left( b \right)} }} > \frac{{C_{{y_{i} }}^{\left( 0 \right)} }}{{C_{{y_{i} }}^{\left( b \right)} }} \ge 90\% } \hfill \\ {G_{{y_{i} }} = \arg \mathop {\max }\limits_{{g_{t} }} P(Y = g_{t} )\prod\limits_{j = 1}^{k} {P(Y^{(j)} } = y^{(j)} \left| {Y = g_{t} } \right.)} \hfill \\ \end{array} } \right.$$
(17)

This identification model can realize the identification of leakage and leakage type.

Model application

Data acquisition and preprocessing

The model was applied to the gas drainage borehole of the 229 working face in the Xiashijie Coal Mine of Tongchuan, as shown in Fig. 4. Mainly No. 4 coal is mined in the working face, the thickness of the coal seam is 0 ~ 34.28 m, the original gas content of the coal seam is 3.48 m3/t, and the gas pressure is 0.4 MPa, which classifies the mine as a high-gas mine. The gas in the coal seam is extracted by a parallel borehole arrangement.

Figure 4
figure 4

Study area.

For the purpose of this study, the characteristic information of gas concentration, flow rate and negative pressure at different depths of the borehole were effectively measured. We designed a detection device to connect each borehole and collect data; the device is shown in Fig. 5. By changing the length of the probe, we monitored the gas concentration and extraction flow at different positions in the borehole. The collected data were used to establish the model discussed in this paper.

Figure 5
figure 5

Gas extraction and detection device.

According to the actual layout of the test boreholes, 30 groups of 240 monitoring data of gas flow, concentration and negative pressure sensors were selected, and they were divided into 18 groups of training samples and 12 groups of test samples according to the ratio of 6:4. The 18 groups of training samples were preprocessed.

Gas drainage borehole leakage data are multidimensional and multivariate35,36, with complex correlations. MDF theory was used to preprocess the data37. The Newton interpolation method was used to eliminate abnormal values and fill the missing values of the training sample data of the gas leakage borehole38 according to Eqs. (18) and (19):

$$\begin{aligned} & f\left( {x_{n} ,x_{n - 1} , \ldots ,x_{1} ,x} \right) \\ & \quad = \frac{{f\left[ {x_{n - 1} , \ldots ,x_{1} ,x} \right] - f\left[ {x_{n} ,x_{n - 1} , \ldots ,x_{1} ,x} \right]}}{{x - x_{n} }} \\ \end{aligned}$$
(18)
$$\begin{aligned} f\left( x \right) & = f\left( {x_{1} } \right) + \left( {x - x_{1} } \right)f\left[ {x_{2} ,x_{1} } \right] \\ & \quad + \left( {x - x_{1} } \right)\left( {x - x_{2} } \right)f\left[ {x_{3} ,x_{2} ,x_{1} } \right] \\ & \quad + \left( {x - x_{1} } \right)\left( {x - x_{2} } \right) \ldots \left( {x - x_{n} } \right)f\left[ {x_{n} ,x_{n - 1} , \ldots ,x_{1} } \right] \\ \end{aligned}$$
(19)

The missing value corresponding to the x-sequence value was substituted into the calculated value \(f\left( x \right)\) to eliminate some abnormal values affecting the overall analysis, fill missing values caused by sensor problems, human operation and other factors, and provide perfect and accurate data for the identification air leaks in boreholes. The complete sample data are shown in Table 2.

Table 2 Multisource data table for gas drainage boreholes.

Table 2 shows 18 groups of drilling test sample data, of which 15 groups correspond to boreholes with leaks and 3 groups correspond to boreholes without leaks. Because the gas concentrations in boreholes 16, 17, and 18 show little change at 0 m, 2 m, 6 m, 9 m, and 12 m, and the proportion is more than 90% according to model Eq. (16), the drainage effect was good, and there is no need to identify the type of leak. In addition, Table 2 shows that due to the air leakage of boreholes, the concentration decreased greatly from the bottom to the orifice in the boreholes in groups 1–15.

The leakage data of the first 15 groups of gas extraction boreholes were standardized by Eqs. (2)–(4), as shown in Table 3. The original data were compared with the box diagram of the standardized data. (Box plots can also be used to detect outliers.) Table 2 and Fig. 6a show that the extraction flow rates in the 15 groups of training samples were very similar, approximately 2.0 m3/min, which was relatively low. From the negative pressure of extraction to the negative pressure of the orifice, the pressure loss was obvious. Due to the air leakage in the gas drainage borehole, the differences in gas concentrations between samples in each group at 0 m, 2 m, 6 m, 9 m and 12 m were large, and the distribution of gas concentration in the borehole was not uniform. The specific positions of the different types of air leaks in the gas drainage borehole differed. Figure 6b shows that the range and distribution trend of the standardized data were consistent with the original data. After data standardization, the range of the original data was reduced to [0, 1], and the influence of each data dimension was eliminated, thereby optimizing the data for subsequent PCA to determine the new index for the identification of leaks.

Table 3 Standardized data for gas extraction boreholes.
Figure 6
figure 6

Data comparison.

New indexes of the model for the identification of leaks

In this study, PCA was used to linearly combine several representative new indexes for identification of leaks. The correlation between the original feature information should be considered to determine whether the PCA is applicable39,40. The Kaiser–Meyer–Olkin (KMO) and Butterley sphericity tests were applied in SPSS, as shown in Table 4.

Table 4 KMO and Bartlett tests.

As shown in Table 4, the value of Bartlett’s test statistic was 65.343, and the significance level was approximately 0, which was less than the statistical significance level (a = 0.05) specified by SPSS. Thus, the original hypothesis was rejected. That is, the variables in the original data had a statistically significant influence, and the KMO test value was greater than 0.5, which indicated that the air leakage data of the gas drainage borehole were suitable for PCA.

According to the standardized data for air leakage of the gas drainage borehole in Table 3, the correlation coefficient matrix of air leakage characteristics was calculated, as shown in Table 5. The closer the correlation coefficient is to 1, the greater the degree of correlation of the corresponding two groups of characteristics; e.g., the correlation coefficient of gas concentrations at 6 m and 9 m is 0.7447, which indicates a strong correlation. The closer the correlation coefficient is to 0, the smaller the degree of correlation of the corresponding two groups of characteristics; e.g., the correlation coefficient of the gas concentrations at 2 m and 12 m is 0.0593, which indicates a weak correlation. A negative correlation coefficient indicates that the two groups are inversely correlated. For example, the correlation coefficient between the gas concentrations at 9 m and 12 m is − 0.1529.

Table 5 Correlation coefficient matrix for air leakage characteristics of gas drainage boreholes.

As shown in Table 5, some of the 8 selected gas drainage borehole leakage characteristics are strongly correlated. Using these 8 kinds of characteristic data to identify gas drainage borehole leakage will lead to an incorrect decision, thus affecting the accuracy of the identification model. Therefore, it is necessary to analyse the training sample data via PCA to obtain the eigenvalue and contribution of each feature and select the appropriate principal component to eliminate the strong correlations from the feature data.

The eigenvalues and cumulative contribution rates of the eight types of feature information were obtained through calculations and analysis, as shown in Table 6. The characteristic value of A1: extraction flow was the largest, and the contribution of its variance contribution was also the largest. The characteristic value and contributions of A2–A8 decreased in turn, and the contributions were small; the 6th–8th principal component, A6–A8, were ignored. The cumulative contribution rate of extraction flow and gas concentrations at 0 m, 2 m, 6 m and 9 m reached 95.54%. According to Eq. (9), the contributions of these variables were more than 85%, and they were preliminarily considered as the main identification indexes of the improved naive Bayesian extraction leakage identification model. In a practical sense, the flow rate and concentration are the main variables of gas extraction in the borehole. The concentrations at 0 m, 2 m, 6 m and 9 m can reflect concentration changes in the borehole. The negative pressure has a linear relationship with the concentration and flow rate, and a change in negative pressure affects the concentration and flow rate; thus, it is advisable to select 5 principal components.

Table 6 Eigenvalues of correlation coefficients.

To further confirm the rationality of this selection, a scree plot was used. A scree plot is a trend map that reflects changes in data characteristics. The steepness of the decrease in eigenvalues shows whether the selected features are correct and reasonable. The scree plot in Fig. 7 shows that the slope k1 of principal components A1–A5 is − 0.63645, and the trend is steep. The slope k2 of principal components A5–A8 is − 0.11931, and the trend is relatively flat. Principal component A5 is an inflection point, and thus, it is reasonable to select these five variables as the principal components.

Figure 7
figure 7

Principal component scree plot.

The original A1–A8 feature information with strong correlations was reconstructed into the selected principal component features Y1–Y5, and the component analysis matrix table of Y1–Y5 was established according to the PCA (Table 7) to establish the new feature information index of gas drainage borehole leakage.

Table 7 Component analysis matrix.

Among the new indexes Y1–Y5, the higher the load coefficient corresponding to the original feature is, the closer the relationship between the feature information and the new indicator, which is the main influence quantity in the new index. According to the component analysis matrix in Table 7, the new index coefficient expression of gas drainage borehole leakage is:

$$\begin{aligned} Y_{1} & = 0.15881A_{1} + 0.44685A_{2} + 0.34194A_{3} + 0.30134A_{4} + 0.37329A_{5} + 0.35162A_{6} + 0.50706A_{7} + 0.21746A_{8} \\ Y_{2} & = 0.03039A_{1} - 0.39790A_{2} + 0.33001A_{3} + 0.46679A_{4} + 0.42654A_{5} - 0.46037A_{6} - 0.25446A_{7} + 0.23521A_{8} \\ Y_{3} & = 0.57660A_{1} - 0.10068A_{2} - 0.34096A_{3} - 0.30251A_{4} + 0.04862A_{5} - 0.06492A_{6} - 0.04217A_{7} + 0.66429A_{8} \\ Y_{4} & = 0.76498A_{1} - 0.06039A_{2} - 0.32301A_{3} - 0.08075A_{4} - 0.19592A_{5} - 0.15732A_{6} - 0.23946A_{7} - 0.42281A_{8} \\ Y_{5} & = - 0.20849A_{1} + 0.00492A_{2} + 0.62830A_{3} - 0.21162A_{4} - 0.43937A_{5} + 0.19106A_{6} - 0.24950A_{7} + 0.47451A_{8} \\ \end{aligned}$$

According to the expression for the characteristic information and the PCA matrix (Table 7), the gas concentrations A2–A6 in the first principal component Y1 at different depths and the load coefficient of negative pressure at orifice A7 were greater than those of the other indexes, and this was the main characteristic influence of the first principal component index Y1. Therefore, the principal component index of Y1 was interpreted as the influencing factor of negative pressure-concentration hole leakage. The load coefficient of A3–A5 in the second principal component index Y2 was higher, so the second principal component index Y2 was interpreted as the influencing factor of hole depth-concentration borehole leakage. By analogy, Y3 was the influencing factor of negative pressure-flow borehole leakage, Y4 is the influencing factor of single flow borehole leakage, and Y5 was the negative pressure-hole gas concentration borehole leakage factor.

Analysis of test sample results

Y1, Y2, Y3, Y4 and Y5 were used as the new identification indexes of the improved naive Bayesian extraction borehole leakage model, and the prior probability under the new index identification was calculated by Eqs. (12)–(13). According to the numerical relationship between the score of the new index and the leakage type of the above 15 groups of samples, the training was performed by MATLAB platform programming, as shown in Table 8. The final score of each index was taken as the training sample of the improved NBC.

Table 8 Improved naive Bayesian training samples.

To verify the accuracy and reliability of the model, 12 sets of gas drainage borehole data corresponding to three different types of leaks in the Xiashijie Coal Mine were collected as test samples, as shown in Table 9. A total of 240 data points from 30 groups of training samples and test samples were divided into a verification set and a test set according to a ratio of 0.6, which prevented a poor model identification rate and overfitting caused by a verification set that was too large as well as inaccurate model verification caused by a test set sample that was too small. In this study, we adopted the hold-out verification method, namely, the twofold cross-validation method. The schematic diagram of k-fold cross-validation is shown in Fig. 841,42. The data set was divided into a training set and a test set for verification, and the average and accuracy of the final verification results were calculated.

Table 9 Test sample data.
Figure 8
figure 8

k-fold cross-validation schematic diagram.

We classified the 12 test sample data using single NBC identification and an improved naive Bayes extraction borehole identification model. The results of the analysis are shown in Table 10. Single NBC identification identified 3 boreholes with type I leaks, 5 boreholes with type II leaks, and 2 boreholes with type III leaks, but it could not identify boreholes without leaks. Improved naive Bayesian identification successfully identified 2 boreholes without leaks, 3 boreholes with type I leaks, 4 boreholes with type II leaks, and 3 boreholes with type III leaks.

Table 10 Improved naive Bayesian identification results.

Table 10 indicates that the single NBC identified the leakage of the No. 8 gas drainage borehole factors, which resulted in errors; this approach could not identify whether the gas drainage borehole was leaking. The recall rate was 75%, and the training time of the identification was 0.0045 s. The identification and analysis recall rate of the improved model was type II, and its real type was type III. This was because the mutual influence between the original eight characteristic Bayesian air leakage identification models of gas extraction boreholes was 98.9%, the identification accuracy improved by 31.8%, and the training time decreased to 0.0020 s, which was an improvement of 55%. To further analyse the error rate, a confusion matrix comparison diagram was drawn43. It showed that the improved naive Bayesian gas extraction borehole leakage identification model could fully identify the type of borehole leakage in the Xiashijie coal mine, the identification accuracy was high, and the identification rate was fast.

As shown in Fig. 9, the single naive Bayesian identification failed to identify boreholes without leaks due to the inability to calculate eigenvalues. This resulted in the effective identification of only 10 out of 12 groups, and type II boreholes were mistakenly identified as type III boreholes. The improved naive Bayes model accurately identified 12 groups of boreholes, and the true value of the improved naive Bayesian model was consistent with the predicted value. Thus, the identification analysis showed that the improved naive Bayesian gas drainage borehole leakage identification model was superior to the single NBC identification analysis. Depending on its superiority, it could more accurately identify the type of air leak and provide further guidance for borehole sealing and repair to improve the efficiency of gas extraction and prevent gas disasters.

Figure 9
figure 9

Comparison diagram of the confusion matrix.

Conclusions

  1. 1)

    Through multisource data fusion theory (MDF) and principal component analysis (PCA), the traditional naive Bayes method was improved, and an enhanced naive Bayes air leakage identification model of gas drainage boreholes was constructed. The new model overcame the shortcomings of the naive Bayes method that could not accommodate missing and nonstandard data and eliminated the misevaluation caused by the superposition of a large amount of feature information in the process of identification of leaks in gas drainage boreholes.

  2. 2)

    The model was applied to the 229 working face of the Xiashijie Coal Mine. Combined with 8 types of characteristic information of gas drainage boreholes. Thirty groups of 240 gas drainage borehole data were divided into training samples and test samples for analysis, and 12 groups of gas drainage borehole test sample data were successfully identified, including 2 boreholes without leaks, 3 boreholes with type I leaks, 4 boreholes with type II leaks, and 3 boreholes with type III leaks, which were consistent with the conditions of the actual gas drainage boreholes. Thus, this study provides a basis for improving gas drainage efficiency and ensuring safe mining in the Xiashijie Coal Mine.

  3. 3)

    The feasibility of the model was verified by the hold-out method. The recall rate of model identification analysis was 98.9%, and the running time was 0.0020 s. Compared with the single naive Bayes method, the operation rate increased by 55%, and the identification accuracy increased by 31.8%. The improved model filled the gap related to the determination and identification of leaks in boreholes and provides a theoretical basis for the evaluation of the quality of sealing and borehole repairs.