Introduction

The pH value of different solutions is a particularly important point to determine the optimized conditions and quality control for industrial, biological, chemical, and environmental science either outdoor or indoor applications1,2. Hydrogen ions concentration [H+] denoted to pH scales from 0 to 14, and the most methods counted for detection are complicated, expensive, and time-consuming as microelectrodes3, acid-based indicator4, potentiometric titration1, colorimetric and fluorescence probes application5,6,7,8,9,10. Currently, potentiometric measurements are the most technique used in pH detection. Where the pH of the solution can be calculated by the measurement of the different voltage between the electrodes of the potentiometric device11. Despite, the high accuracy of conventional potentiometric devices, the operation and calibration process is more complicated and costly, which is not applicable to indoor or outdoor purposes. However, the easy and accessible pH strips are used as an alternative method for visual pH detection, but the strips produce lower precise results.

On the other hand, machine learning (ML) techniques give algorithms the ability to predict novel values from training data derived from experiments using Artificial Intelligence (AI). Thus, there are numerous regression or classification algorithms for ML that depend on hyperparameters and mechanisms to achieve their goals and give high performance for planning12. ML is being used in chemistry such as chemical discovery13, molecular representations14, synthetic chemistry15, materials chemistry16, aquatic chemistry research17,18, and water pollution19.

Here, the ML technique was used to improve the precision of common strip pH paper. ML models were trained on the 2689 experimental data which covered the whole pH range. Further, we developed a mobile/web application based on ML algorithms to predict the pH values. Therefore, the developed app could work on mobile which could be used as portable devices for anyone (whether a chemist or not) without additional costs, fast response, and is applicable for different applications.

Materials and apparatus

Acetic acid, phosphoric acid, boric acid, HCl, and NaOH were used without any further purification (Sigma Aldrich). Universal indicator pH paper (1–14) Q/3211821AB001-2002 (China). The pH measurements were carried out on a 3520 pH Meter (JENWAY, England). Mastech MS6612 Digital Luxmeter Illuminometer Light (Range Peak 200,000 Lux) was used for measuring the light intensity in the experimental workplace.

pH buffer solution

Universal Britton–Robinson (B–R) buffer was prepared as reported20. Briefly, the stock aqueous B-R buffer solutions (pH = 2.86) by mixing equal molar ratio (1:1:1) of 0.02 mol/L from acetic acid, phosphoric acid, and boric acid. Dropwise of 0.20 mol/L of NaOH or 0.20 mol/L of HCl was used for adjusting the pH values (interval = 0.10) to cover the whole pH range.

Machine learning algorithms

Regression is a technique used for prediction continues pH values learning and figuring out causal relations between the actual and prediction pH values. Eleven supervised machine learning regression models were applied to the data collected and choose the best model that fits with the selected problem, including Linear Regression (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RTR), K Neighbors Regressor (KNNR), Support Vector Regression (SVR), Lasso regression (L1), Ridge Regression (L2), Elastic Net regressor (ENR), AdaBoost Regression (ABR), Gradient Boosting Regressor (GBR), and Artificial Neural Network Regressor (ANNR). All models can be found in Scikit-learn in the class model21. In addition, the data visualization of exploratory data analysis and heatmap figures were created using the seaborn package based on python code22.

Metrics for regression

Several metrics were used for evaluating the regression models, coefficient of determination (R2), Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) can be calculated Scikit-learn in class metrics according to Eqs. (14).23,24

$${R}^{2}=1-\frac{{\sum }_{i} {\left({y}_{i}-{\hat{y}}_{i}\right)}^{2}}{{\sum }_{i} {\left({y}_{i}-{\overline{y}}\right)}^{2}}$$
(1)
$$MSE=\frac{1}{N}{\sum }_{i=1}^{N} {\left({y}_{i}-{\hat{y}}_{i}\right)}^{2}$$
(2)
$$MAE=\frac{1}{N}{\sum }_{i=1}^{N} \left|{y}_{i}-{\hat{y}}_{i}\right|$$
(3)
$$RMSE=\sqrt{\frac{{\sum }_{i=1}^{N} {\left({y}_{i}-{\hat{y}}_{i}\right)}^{2}}{N}}$$
(4)

where N is the number of recorded samples, yi is the predicted pH value, and \(\hat{y}_{i}\) is the actual pH value.

Automated color-information-extraction from the captured images

To extract the color code (RGB) from images, we used a Python 3.7 code based on the OpenCV package to extract the RGB for each image25. We noted a small deviation of the RGB values at various positions in one image. Thus, the RGB values were estimated at seven distinct (X, Y) positions (10,10;15,15; 20,20; 25,25; 30,30; 35,35; 40,40) to cover the whole image as illustrated in Fig. 1.

Figure 1
figure 1

Pixel positions of pH paper image (at pH value = 4 , as an example).

pH discrimination with a machine learning model

We exploited a KNN regression model-based machine learning algorithm to study 2689 collected sample data using Python 3.7 and the scikit-learn package26,27. We randomly separated the data into training data (70%, i.e., 1880 samples) and testing data (30%, i.e., 808 samples). In the inference model training phase, the testing data was completely excluded. Furthermore, in machine learning, hyperparameters are those parameters that are explicitly provided by the user to influence the learning process and improve the learning of the model. Thus, we trained our models using a series of integer number [1,2,3,….], and as a result of that the optimal hyperparameters (highest coefficient, and lower errors) was found when we used K = 5.

Result and discussion

Figure 2 presents collections of 130 captures of an experimentally colored change of the pH paper (at 350 Lux) in the range of (0–14) by an interval of ~ 0.1 pH-value. It is worth mentioning that the traditional estimation based on the color change of pH paper is accompanied by a significant variance in pH value (~ 2). This high variance of pH value led to a noteworthy wrong estimation by eye detection. This finding encourages us to develop a new simple and more precise method for pH-value detection. Thus, the experiments were extended to cover most of the three different illumination workplaces at 350, 200, and 20 Lux, that the user could work on. Moreover, the homogeneity of the color of the pH paper was emphasized by the collected color RGB code for seven distinct positions per capture. In total, the data set includes 2689 experimental RGB values from different illumination workplaces.

Figure 2
figure 2

Samples of the captures of an experimentally colored change of the pH-paper at 350 Lux.

To better understand the observed results in the different workplaces, Exploratory Data Analysis (EDA) of color code RGB against pH values with respect to different light intensities at 20, 200, and 350 Lux, was illustrated in Fig. 3.

Figure 3
figure 3

Exploratory Data Analysis of changed RGB code in different illuminated workplaces at (20, 200, and 350 Lux).

The color code points were collected in three parts in a wide pH range. The significant changes in the color code of Red and Green or even Blue were in the range of (2.5: 9) pH values at the three different investigated workplaces of light intensities at (20, 200, and 350 Lux). It is worth mentioning, that the blue color code at low-intensity light of 20 Lux (a little dark workplace) deviates from those obtained in higher or medium light intensity, which suggests avoiding future testing in low light conditions. In contrast, the results revealed no significant difference between the behavior of Red or Green colors at light intensity. The results show the increase in basicity (> 9) or increase in acidity and (< 2.5) could interpret the color and may produce less accurate prediction in that part of the pH range. Thus, this finding may encourage the scientific community to prepare higher sensitive material to work in strong acid and/or Strong base medium.

Furthermore, it is critical to recognize and evaluate how dependent each parameter is on the others. This knowledge can aid in the definition of the expectations that these interdependencies provide, leading to the creation of more effective pH devices and color-sensitive materials. Because of this, using a machine learning strategy, the statistical Pearson’s correlation coefficients (rx,y) between the pH parameters were investigated based on the following Eqs. (5) and (6):

$${\mathrm{cov}}_{x,y}=\frac{\sum \left({x}_{i}-{\overline{x}}\right)\left({y}_{i}-{\overline{y}}\right)}{N-1}$$
(5)
$${r}_{xy}=\frac{{\sum }_{i=1}^{N} \left({x}_{i}-{\overline{x}}\right)\left({y}_{i}-{\overline{y}}\right)}{\sqrt{{\sum }_{i=1}^{N} {\left({x}_{i}-{\overline{x}}\right)}^{2}}\sqrt{{\sum }_{i=1}^{N} {\left({y}_{i}-{\overline{y}}\right)}^{2}}}$$
(6)

where N number of recorded samples, \({x}_{i}\), \({y}_{i}\) are individual elements of RGB and pH predicted values respectively, and \(\overline{y}\) the mean value of pH values.

The correlation between the pH parameters was presented with a heatmap in Fig. 4. The obtained results reflect an excellent higher negative correlation between the pH values with Red color (−0.77). In the same way, an acceptable correlation of pH value with the green color by (−0.38). The blue color showed an incredibly low correlation with pH value (0.044) from those observed in the red or green colors. This refers to that the blue color will have a small effect on the machine learning prediction compared to the red and green colors. In the same way, the illumination of workplaces has no significant effect on the pH value by −0.03. Thus, the colored pH paper can be safely captured whatever the light intensity.

Figure 4
figure 4

Pearson’s correlation coefficients between the pH parameters.

ML model prediction

Using experimental data, a preliminary analysis of machine learning regression techniques was performed with optimal hyperparameters on K-Nearest Neighbors (KNN), Linear, Lasso, Elastic Net, AdaBoost, Neural Network, Random Forest, and Support vector machine (SVM), and Gradient Boosting Regressor algorithms28,29,30 to estimate coefficients of determination (R2) and the minimum errors of the corresponding regression evaluation metrics concerning root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE) as shown in Fig. 5 and recorded in Table 1.

Figure 5
figure 5

Output results of performed regression algorithms of Linear, Ridge, Lasso, Elastic Net, Polynomial, Support vector machine (SVM) Regresso, Gradient Boosting, AdaBoost, and Random Forest Regressor.

Table 1 Regression evaluation metrics of performed algorithms on the experimental data (cross-validation with K-fold = 5).

It's obvious that the KNN model with optimal hyperparameters of five points performs significant result of R2 (0.993) combined with the lowest errors of MSE, RMSE, and MAE (0.012, 0.320, and 0.182, respectively) compared to other models. In addition, the coefficient of the variation of the root means square error (CVRMSE) of KNN models shows a higher stability performance of 4.077 compared to other models. Further, the cross-validation with K-fold of (3, 5, 10, and 20) was tested for confirming the stability of the models. However, no significant difference was found between the results, which verified the KNN models.

To deepen understanding, further investigation showed that the results of the model's prediction (based on test data) vs the experimentally obtained pH values are represented in the scatter plot in Fig. 6. The linear regression, elastic net, and Neural network algorithms could not recognize the whole experimental points, especially at the strong acid/base pH range. However, a precise estimate would be placed along a square-diameter line using KNN, Gradient boosting, Random Forest, and AdaBoost algorithms, which could be selected for further steps of deploying the code. Despite the higher performance and exceedingly small deviation of those algorithms, the KNN was chosen for deploying the machine learning mobile application due to having the lowest errors (RMSE; 0.32) and higher stability (CVRMSE; 4.08) as well.

Figure 6
figure 6

The model's prediction results (based on the test data) vs. experimental results.

It is now clear that the KNN model can successfully show the underlying patterns of the color RGB code in the pH value estimation based on experimental data collections. Thus, the machine learning approach based on this model was further expanded and used to develop a versatile platform able to predict the pH value using common pH paper with high accuracy. The online mobile application of the prediction model was developed using python code and streamlit cloud (freely available) and permits the highly predicted determination of the pH value as a function of the RGB color code of common pH paper.

As illustrated in Fig. 7 the mobile application includes three steps; starting with the input file which could be able to insert the pH paper capture (after being immersed in the target solution immediately). For more facility, we have coded three options (upload a picture, use a mobile camera, or insert a RGB color code). This step is followed by a built-in Machin learning process (without control from the user). Finally, the output of the pH value will appear on the screen.

Figure 7
figure 7

Schematic process of pH detection using ML.

Our study has a significant advantage over what is already used, Fig. 8 shows the fair comparison of pH instruments, pH paper, and the current study.

Figure 8
figure 8

Comparison of pH instrument, pH paper, and the current study.

Furthermore, Fig. 9 shows the estimated pH value (output results) of the proposed mobile application in comparison with the real pH value. Interestingly, this correlation between real and estimated values in the whole range of pH (acid or base) is related to the higher accuracy of the used ML model.

Figure 9
figure 9

Estimated pH value from mobile application in comparison with real one.

However, Solmaz et al.31 studied pH strips colorimetric detection using ML, as presented in Table 2.

Table 2 The advantages of the current work that related studies.

However, four different types of smartphones were used to check the accuracy of pH value predictions for three buffer solutions (pH = 3, 7, and 10). The default setting was used to avoid any smartphone effects. As shown in Fig. 10 and Table 3, the various smartphones do have no significantly different pH value estimations with an accuracy of more than 90% for each type.

Figure 10
figure 10

Estimated pH value from different smartphones.

Table 3 Accuracy and estimated pH value from different smartphones.

Furthermore, Table 4 shows recommended conditions and limitations for using the application to achieve more accurate predictions.

Table 4 Recommended mesurment conditions for application users.

Overall, the present findings solve the problem of pH accuracy using common pH paper without the need for additional costly and time-consuming experimental work. However, our approach solves the problems of excessive cost and maintenance required for traditional pH meters.

Conclusion

The findings demonstrate a strong negative association between pH values and both the red color (−0.77) and the green color (−0.38). The blue color will have an insignificant impact on machine learning prediction which revealed a low correlation (0.044). The KNN model exhibits significant R2 (0.993) results along with the lowest MSE, RMSE, and MAE errors (0.012, 0.320, and 0.182, respectively). This paper also demonstrated the potential of the ML approach to estimate the pH value of solutions using common pH paper. We developed a freely available application that supported mobile devices to predict the pH value based on ML and using common pH paper with precise results. Future research should consider the preparation of new optical material with extremely sensitive color changes in a strong acid/base medium.