Introduction

Linear regression is commonly applied in every field. The aim of regression analysis is to investigate the relationship between two or more independent variables and a dependent variable. Consider a multiple regression model in matrix form:

$${\text{y}} = \user2{X\beta } + {\varvec{\varepsilon}},$$
(1)

where \({\varvec{y}}\) is an \(n\times 1\) vector of response variables, \({\varvec{X}}\) is known as the design matrix of order \(n \times p\), \({\varvec{\beta}}\) is a \(p\times 1\) vector of unknown parameters and \({\varvec{\varepsilon}}\) is a \(n\times 1\) vector of identically and independently distributed errors. The Ordinary Least Squares (OLS) is generally applied to estimate the unknown parameters in the regression model. According to1,2, OLS estimator \(\beta\) is obtained as follows:

By minimizing

$$\begin{aligned} & \varepsilon \varepsilon^{\prime} = \left( {{\varvec{y}} - {\varvec{X}}\beta } \right)^{\prime } \left( {{\varvec{y}} - {\varvec{X}}\beta } \right) = \user2{y^{\prime}y} - 2\beta^{\prime}\user2{X^{\prime}y} + \beta^{\prime}\user2{X^{\prime}X\beta } \\ & \frac{{\delta \left( { \varepsilon \varepsilon^{\prime}} \right)}}{\delta \beta } = - 2\user2{X^{\prime}y} + 2\user2{X^{\prime}X}\beta = 0 \\ & \user2{X^{\prime}X}\beta = \user2{X^{\prime}y} \\ & \hat{\beta } = \left( {\user2{X^{\prime}X}} \right)^{ - 1} \user2{X^{\prime}y}. \\ \end{aligned}$$
(2)

One of the problems that can occur is when the number of explanatory variables is greater than the number of observations. Another problem is the issue of outliers. Observations that deviate from the distribution’s general shape or pattern are called outliers3,4. Outliers are common in agricultural data because of natural variation. These outliers add to the standard errors and error variance5,6. Outliers can happen because of many non-homogeneous observations and measurement errors7.

An effective method for analysing data that has been polluted by outliers is robust regression8,9,10,11. Reliable findings can be obtained from robust regression, which can become proficient at spotting outliers. It can get used to seeing anomalies and provide accurate results when anomalies are present12. If the data have outliers, there will be heterogeneity13,14.

Heterogeneity is the degree of variability within the data15. Traditional approaches for linear regression, like the OLS, fail when there are heterogeneity problems because the OLS cannot be calculated when the number of observations is less than the number of parameters (that is, when n < p) or the estimate becomes inefficient and unstable. Hence, it is crucial to acknowledge the treatment effects of heterogeneity16. Heterogeneity causes the standard errors to be biased and inconsistent. Feczko and Fair17 stated that the problem of homogeneity assumption is a big challenge and leads to the heterogeneity problem. Feczko et al.18 stated that supervised and unsupervised statistical approaches were emphasized to study heterogeneity problems. Assunção et al.19 stated that the issue of heterogeneity is usually considered in regression analysis, the issue of missing data affects the ability to control the significant variables in the estimation of the regression coefficients, which affects the stability of the inference. Gormley et al.20 claimed that a vital problem in financial empirical research is the issue of how and whether to control for heterogeneity that is not observed.

Sparse regression can be used to regulate the coefficients of the regression. This will reduce the sampling error and variance and solve the problem of multicollinearity. It is used in many scientific disciplines21. Sparse regression is used to increase predictive accuracy in high-dimensional situations22,23,24,25,26,27,28.

In this paper, a new regression method called the modified heterogeneity sparse regression model is proposed that tackles the problems of heterogeneity. The contributions of the paper are: (i) The modified heterogeneity sparse regression model is introduced. The heterogeneity parameter is identified using the idea of a variance inflation factor, the significant variables are selected, and the main effects of the parameters are included back in the model. (ii) Significant parameter selection using the 15, 25, 35, and 45 high important variables is used. (iii) The impacts of heterogeneity before, after and modified heterogeneity through metric validation are evaluated. (iv) A hybrid model to reduce outliers is developed.

Figure 1 shows the flowchart of the methodologies used to achieve the objectives of the study. It shows the addition of all possible models up to second order and the testing of different assumptions. The 15, 25, 35 and 45 parameters are selected because feature selection can only provide the rank of important variables and does not tell us the number of significant factors. The insignificant parameters are dropped, and the parameters that exhibit heterogeneity are later included in the modified model. Next, the validation metrics are computed using the mean absolute percentage error (MAPE), mean squared error MSE and coefficient of determination (R2). Next, the hybrid models are developed for before, after, and modified heterogeneity using robust methods and machine learning models. The robust methods that are used are M Bi-Square, M Hampel, M Huber, MM and S. Finally, the validation metrics are computed using the 2-sigma limits to identify the number of outliers.

Figure 1
figure 1

Flowchart of the methodology.

All possible models and the selected model using 15, 25, 35 and 45 high ranking variables

In this study, the data from 1914 observations collected in Sabah, Malaysia will be applied to investigate the impact of 29 different variables on one dependent variable. By simplyfying \({2}^{29}=536870912\) equations, with 29 variables. Studying a total of 536,870,912 will be impossible and difficult. Therefore, the main effects and the second order interaction are restricted. With this restriction, the study comprised the effects of 435 distinct main effects and single order interaction independent variables on one response variable (the moisture content), which led to big data. The concept of big data has led to advancements in data science using machine learning29. The total number of models can be computed by Eq. (3).

$$M = \mathop \sum \limits_{j = 1}^{f} j\left( {\begin{array}{*{20}c} f \\ c \\ j \\ \end{array} } \right)$$
(3)

where \(M\) represents the total number of possible models, \(f\) represents the total number of explanatory variables and \(j=\mathrm{1,2},3,\dots ,f\). The 15, 25, 35 and 45 high important ranking variables are selected because, in feature selection, the ranks of important variables and not the number of significant factors are available30. Also, there is no rule for choosing the number of parameters to be incorporated in a prediction model31. Furthermore, the algorithms can only tell us the ranks and not the number of significant parameters32. The effect of the interaction variables need to be considered because of their vital role in the real data. The interactions are crucial for describing the nature of interactions in a dataset. The interaction helps to understand the relationships among the variables available in the model and more hypotheses can be tested33,34. Although it is challenging to study the asymptotic and statistical inferences of second order because of their complex covariance structure35.

The proposed method

The traditional methods normally used for modelling assumed it to be a homogenous model and in machine learning, data scientists overlook heterogeneity. In this study, a method is proposed to identify heterogeneity parameters. Heterogeneity refers to the variability of observations. This variability leads to inconsistent estimates and distort conclusions20. Suppose the multiple linear regression:

$$Y_{i} = \beta_{0} + \beta_{1} T_{i,1} + \beta_{2} T_{i,2} + \ldots + a_{j} + \varepsilon_{i} ,$$
(4)

where \({Y}_{i}\), \(i=1, 2,\dots ,n\) is the response value for the ith case moisture content, estimates \({\beta }{\prime}s\) are the regression coefficients for the explanatory variables (drying parameter) \({T}{\prime}s\), \({a}_{j}\) denote heterogeneity, for \(j=1, 2,\dots ,f\). That is, the parameters that exhibit heterogeneity and \(\varepsilon\) is the random error. In the equation above, if the estimates of the regression equation are computed and a crucial variable is omitted, then the estimate \(\beta\) will be biased and inconsistent. It is also possible that some variables are correlated with the error term, which violates the assumption of regression. The coefficient of determination can be written as \({R}^{2}=1-\frac{1}{VIF}\). If the R2 satisfied certain conditions, then the parameter is said to exhibit heterogeneity. Cheng et al.36 stated that the variance inflation factor in multiple regression is used to quantify the level of severity. It can be computed with \({R}_{l}^{2}\) where \({R}_{l}^{2}\) for \(l=\mathrm{1,2},\dots ,p\) denotes the quantity of determination between the \({l{\text{th}}}\) variable \({x}_{l}\) in the predictor matrix and the variables not related to it.

Let

$$\begin{aligned} & {\varvec{X}}^{*} = \left[ {\begin{array}{*{20}c} 1 & {X_{11} } & \ldots & {X_{1,p - 1} } \\ 1 & {X_{21} } & \cdots & {X_{2,p - 1} } \\ \vdots & \vdots & \vdots & \vdots \\ 1 & {X_{n1} } & \cdots & {X_{n,p - 1} } \\ \end{array} } \right]. \\ & {\varvec{X}}^{*\prime } {\varvec{X}}^{*} = \left[ {\begin{array}{*{20}c} n & {0^{\prime}} \\ 0 & {r_{XX} } \\ \end{array} } \right], \\ \end{aligned}$$

So that \({r}_{xx}\) will be the correlation matrix representing the \({\varvec{X}}\) variables.

Since

$$\begin{aligned} \sigma^{2} \left\{ {\hat{\beta }} \right\} & = \sigma^{2} \left( {{\varvec{X}}^{*\prime } {\varvec{X}}^{*} } \right)^{ - 1} \\ & = \sigma^{2} \left[ {\begin{array}{*{20}c} \frac{1}{n} & {0^{\prime}} \\ 0 & {r_{XX}^{ - 1} } \\ \end{array} } \right]. \\ \end{aligned}$$

The \({VIF}_{l}\) for \(l=\mathrm{1,2},3,\dots ,p-1\) stands for the \({l{\text{th}}}\) diagonal element of \({r}_{{\varvec{X}}{\varvec{X}}}^{-1}\). If the proof for \(l=1\), then the rows and columns \({r}_{{\varvec{X}}{\varvec{X}}}\) can be permutated to obtain the result for the remaining \(l\).

Let

$${\varvec{X}}_{{\left( { - 1} \right)}} = \left[ {\begin{array}{*{20}c} {X_{12} } & \cdots & {X_{1,p - 1} } \\ {X_{22} } & \cdots & {X_{2,p - 1} } \\ \vdots & \vdots & \vdots \\ {X_{n2} } & \cdots & {X_{n,h - 1} } \\ \end{array} } \right],\quad X_{1} = \left[ {\begin{array}{*{20}c} {X_{11} } \\ {X_{21} } \\ \vdots \\ {X_{n1} } \\ \end{array} } \right].$$

By applying Schur’s complement,

$$\begin{aligned} r_{XX}^{ - 1} \left( {1,1} \right) & = \left( {r_{11} - r_{{1X_{{\left( { - 1} \right)}} }} r_{{X_{{\left( { - 1} \right)}} X_{{\left( { - 1} \right)}} }}^{ - 1} r_{{X_{{\left( { - 1} \right)}} }} 1} \right)^{ - 1} \\ & = \left( {r_{11} - \left[ {r_{{1X_{{\left( { - 1} \right)}} }} r_{{X_{{\left( { - 1} \right)}} X_{{\left( { - 1} \right)}} }}^{ - 1} } \right]r_{{X_{{\left( { - 1} \right)}} X_{{\left( { - 1} \right)}} }} \left[ {r_{{X_{{\left( { - 1} \right)}} X_{{\left( { - 1} \right)}} }}^{ - 1} r_{{X_{{\left( { - 1} \right)}} }} 1} \right]} \right)^{ - 1} \\ & = \left( {1 - \beta_{{1X_{{\left( { - 1} \right)}} }}{\prime} X_{{\left( { - 1} \right)}}{\prime} X_{{\left( { - 1} \right)}} \beta_{{1X\left( { - 1} \right)}} } \right)^{ - 1} , \\ \end{aligned}$$

where \({\beta }_{{1X}_{\left(-1\right)}}\) means the regression coefficient of \({X}_{1}\) on \({X}_{2},\dots ,{X}_{p-1}\) except the intercept. For clarity, \({R}_{1}^{2}\) and \({VIF}_{1}\) can be written as:

$$\begin{aligned} R_{1}^{2} = \frac{SSR}{{SSTO}} & = \frac{{\beta_{{1X_{{\left( { - 1} \right)}} }}^{\prime } X_{{\left( { - 1} \right)}}^{\prime } X_{{\left( { - 1} \right)}} \beta_{{1X_{{\left( { - 1} \right)}} }} }}{1} \\ = \beta_{{1X_{{\left( { - 1} \right)}} }}^{\prime } X_{{\left( { - 1} \right)}} \beta_{{1X_{{\left( { - 1} \right)}} }} \\ \end{aligned}$$

and \(VIF_{1} = r_{XX}^{ - 1} \left( {1,1} \right) = \frac{1}{{1 - R_{1}^{2} }}\).

Ridge regression (RR)

In the ridge regression, ridge parameter plays a crucial role in parameter estimation. The ridge regression estimator is an option for the OLS estimator when there is multicollinearity37.

From Eq. (1), if \({y}_{i}\) is the result and \({x}_{i}={\left({x}_{i1},{x}_{i2},{x}_{i3},\dots ,{x}_{ip} \right)}^{T}\) denote the covariate vector for the ith case. The least square is the most common estimation method, where the coefficients \(\beta ={\left({\beta }_{0},{\beta }_{1},{\beta }_{2},\dots ,{\beta }_{p} \right)}^{T}\) have been chosen to minimize the sum of squares residual (SSR)38,39,40. The SSR is given as:

$$SSR = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \beta_{0} - \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)^{2}$$

The coefficients of the ridge regression estimate \(\hat{\beta }^{RR}\) minimize38,39,40,41,42:

$$\begin{aligned} L^{RR} \left( \beta \right) & = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \beta_{0} - \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)^{2} + \lambda \mathop \sum \limits_{j = 1}^{p} \beta_{j}^{2} \\ & = SRR + \lambda \mathop \sum \limits_{j = 1}^{p} \beta_{j}^{2} \\ \end{aligned}$$
(5)

where \(\lambda \ge 0\) denotes the regularization parameter for controlling the shrinkage parameter. The ridge regression estimates the coefficients that make the SSR small and fit the data well. The term \(\lambda \sum_{j=1}^{p}{\beta }_{j}^{2}\) is the shrinkage penalty. The ridge regression has the advantage over least squares because of the bias-variance trade-off. The prediction error is reduced in ridge regression by shrinking the large regression coefficients.

LASSO regression (LR)

The drawback of ridge regression is the inclusion of all p explanatory variables in the final model. This penalty \(\lambda \sum_{j=1}^{p}{\beta }_{j}^{2}\) in Eq. (5) will decrease all the coefficients towards zero, but not exactly to zero43. The LASSO regression is a relatively easy option to the ridge regression that helps to overcome this drawback. The LASSO coefficients \({\widehat{\beta }}_{\lambda }^{LASSO}\), minimize the quantity \(L^{LR} \left( \beta \right) =\) \(\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \beta_{0} - \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)^{2}\) \(+ \lambda \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right| = SRR + \lambda \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right|\)41,42,44,45. The LASSO utilizes an \({L}_{1}\) penalty instead of an \({L}_{2}\) penalty and LASSO will shrink the estimates of the coefficients towards zero28.

Elastic net regression (ENR)

The ENR is an expansion of the LASSO that is robust to the high correlations among the explanatory variables and was used to solve the problems of the LASSO regression model, whose variable selection can depend too much on data and is not stable41,46,47. The penalties of Ridge and LASSO were combined to obtain the best. In fact, the aim is to minimize the following loss function38,39,40,41,42,44,45,48,49

$$L^{ENR} \left( \beta \right) = \frac{1}{2n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \beta_{0} - \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)^{2} + \lambda \left( {\frac{1 - \alpha }{2}\mathop \sum \limits_{j = 1}^{p} \beta_{j}^{2} + \alpha \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right|} \right)$$

where \(\alpha\) is the mixing parameter between LASSO and ridge.

Since the presence of outliers influences the performance of ordinary least squares, to solve the issue, robust regression that performs better and is not easily influenced by outliers has been developed50. These robust methods are MM Estimation, M Estimation, Least Trimmed Squares (LTS), S Estimation, Least Median of Squares (LMS), and Least Absolute Value Method (LAV)51,52,53,54,55,56. This study focuses on M-Estimation, MM Estimation and S Estimation.

S estimation

S estimators are proposed by Rousseeuw and Yohai57. It is an estimate for the regression related to M-scales. It is built on the residual scale of M estimation. It is used to overcome the weakness of the median by using the residual of the standard deviation. Because the M estimation only makes use of the median for the weighted value, it does not consider data distribution and is not a function of all the data58. According to Salibian-Barrera and Yohai59, the S-estimator can be defined as \({\widehat{\beta }}_{S}={min}_{\beta }{\widehat{\sigma }}_{sd}\left({e}_{1},{e}_{2,}{e}_{3},\dots .,{e}_{n}\right)\) with a robust estimator and fulfilling Eq. (6).

$$min\mathop \sum \limits_{i = 1}^{n} \rho \left( {\frac{{y_{i} - \mathop \sum \nolimits_{j = 0}^{k} x_{ij} \beta }}{{\hat{\sigma }_{sd} }}} \right)$$
(6)

where \({\widehat{\sigma }}_{sd}=\sqrt{\frac{1}{nK}\sum_{i=1}^{n}{w}_{i}{e}_{i}^{2}}\) and \(K=0.199,{w}_{i}={w}_{\sigma \left({u}_{i}\right)=}\frac{\rho \left({u}_{i}\right)}{{u}_{i}^{2}}\) and the estimate of the initial \(\hat{\sigma }_{sd} = \frac{{median\left| {e_{i} - median\left( {e_{i} } \right)} \right|}}{0.6745}\). The solution is achieved by differentiating to β, t.

$$\mathop \sum \limits_{i = 1}^{n} x_{ij} \psi \left( {\frac{{y_{i} - \mathop \sum \nolimits_{j = 0}^{k} x_{ij} \beta }}{{\hat{\sigma }_{sd} }}} \right) = 0,\quad i = 0,1,2, \ldots ,k$$
(7)

Ψ is the function as derivatives of ρ:

$$\psi \left( {u_{i} } \right) = \rho^{\prime}\left( {u_{i} } \right)\left\{ {\begin{array}{*{20}l} {u_{i} \left[ {1 - \left( {\frac{{u_{i} }}{c}} \right)^{2} } \right]^{2} ,} \hfill & {\left| {u_{i} } \right| \le c} \hfill \\ {0, } \hfill & {\left| {u_{i} } \right| > c} \hfill \\ \end{array} } \right.$$
(8)

MM estimation

MM estimation can be used to estimate the parameters of the regression using S estimation, the scale of the residuals can be minimized from M estimation and then progressed to M estimation. The purpose of MM estimation is to obtain estimates that are more efficient and have a high breakdown point. The breakdown point is used to find the percentage of outliers in the data before the observations affect the model58,60. The solution of the MM estimator is given as:

$$\mathop \sum \limits_{i = 1}^{n} \rho_{1}{\prime} \left( {u_{i} } \right)X_{ij} = 0\quad {\text{or}}\quad { }\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} { }\rho_{1}{\prime} \left( {\frac{{Y_{i} - \mathop \sum \nolimits_{j = 0}^{k} X_{ij} \hat{\beta }_{j} }}{{SD_{MM} }}} \right)X_{ij} = 0$$
(9)

\(SD_{MM}\) represents the standard deviation from the residual of S estimation and \(\rho\) is the function for Tukey’s biweight defined as:

$$\rho \left( {u_{i} } \right) = \left\{ {\begin{array}{*{20}l} {\frac{{u_{i}^{2} }}{2} - \frac{{u_{i}^{4} }}{{2c^{2} }} + \frac{{u_{i}^{6} }}{{6c^{2} }},} \hfill & { - c \le u_{i} \le c} \hfill \\ {\frac{{c^{2} }}{6},} \hfill & {u_{i} < - c\;\;or\;\;u_{i} > c{ }.} \hfill \\ \end{array} } \right.$$
(10)

Evaluation metric

The suitability and accuracy of the models were evaluated using the mean absolute percentage error (MAPE), mean square error (MSE) and coefficient of determination \({R}^{2}\) in Table 1, n is the number of observations, \({y}_{i}\) is the actual value and \({\widehat{y}}_{i}\) is the forecast value.

Table 1 Evaluation metric.

Results and discussion

The variability of the 29 main drying parameters is shown in Fig. 2 and the heterogeneity of drying parameters among the seaweed drying parameters is identified using the proposed method. The assumptions of linearity between the dependent and independent parameters are verified, and it was found that no linear relationship exists between them. The independence of observations is verified using the Durbin Watson test. The results show that the p-value of 0, is less than the of significance α = 0.05, which reveals that the residuals are autocorrelated and observations are dependent. Furthermore, the normality assumption is verified using the Kolmogorov–Smirnov test. With a p-value of 0 that is less than the level of significance α = 0.05, there is sufficient proof to say that the residuals do not represent a normal distribution.

Figure 2
figure 2

Box Plot for the Seaweed drying parameters.

The MAPE, MSE, and \({R}^{2}\) are shown in Table 2 for the 15, 25, 35 and 45 high-ranking variables selected for before, after and modified heterogeneity sparse regression models. The accuracy of the proposed models is unique according to the MAPE for the different models. The elastic net BH has an MAPE that is comparable to the LASSO BH, and LASSO BH outperforms the other models (that is, ridge BH, elastic net BH, ridge AH, LASSO AH, elastic net AH, ridge MH, LASSO MH, and elastic net MH). Generally, the higher the number of high-ranking variables selected, the better the predictive accuracy. This is comparable to65,66 where random LASSO performed better. With these results, LASSO BH with 45 high ranking variables is the best to determine the moisture content of the seaweed. The value MAPE (8.149872) denotes the average percentage error between the moisture content removal of the seaweed predicted by the model and the real value. MAPE, MSE, and SSE are commonly used in model validation for model accuracy. The MAPE, MSE, and SSE are used to exemplify the agreement between the observations and predictions. In addition, they can be used to select the best model among competing models67. Particularly, the explained MAPE, MSE and SSE quantify the variability of the error between the actual data and the predicted value. The value of R-square (0.8845778), implies that 88.45778% of the variance in the dependent variable moisture content can be explained by the selected drying parameters. Figure 3 displays the boxplots of the MAPEs for the different 9 models using only the 45 high-ranking variables. It includes outliers and is reflected in the outlying estimates for the MAPEs. The elastic net BH has the lowest value for the average MAPE, where the averages are indicated by red circles in Fig. 3.

Table 2 Evaluation metrics.
Figure 3
figure 3

Boxplot of the MAPEs obtained from 45 high-ranking variable.

To observe and detect the outliers, the sigma limits and standardized residual plots are observed for each model in Figs. 4, 5 and 6 which show the standardized residual plots of ridge, LASSO, and elastic net respectively, for 45 high ranking variables. The percentage of outliers is calculated using the number of observations outside the 2-sigma limit.

Figure 4
figure 4

Outliers for ridge for the 45 high important variables.

Figure 5
figure 5

Outliers for LASSO for the 45 high important variables.

Figure 6
figure 6

Outliers for elastic net for the 45 high important variables.

Table 3 reveals the number of outliers with their respective percentages using the 2-sigma limits. Data can have outliers because of many factors that cannot be explained, and the outliers can affect the predictive accuracy5,6,68,69. Data with outliers makes the least squares estimator inefficient, unsteady, unstable, and unreliable70. In the area of agriculture, data with outliers is frequent58,71.

Table 3 Number and percentage of outliers outside 2 sigma limits before, after and modified heterogeneity parameters.

For the 15 high ranking variables, ridge BH has the lowest outliers with 77 observations, this denotes 4.02% of the total observations. For the 25 high ranking variables, ridge BH has the lowest outliers with 90 observations, this denotes 4.70% of the total observations. For the 35 high ranking variables, ridge MH has the lowest outliers with 91 observations, this denotes 4.75% of the total observations. For the 45 high ranking variables, ridge MH has the lowest outliers with 90 observations, this denotes 4.70% of the total observations.

Table 4 shows the number of outliers with their percentages for the hybrid model using the hybrid modified sparse regression with robust estimator. For the 45 high ranking variables, the hybrid model of ridge with the M Hampel estimator before heterogeneity has the smallest number of outliers of 45 when compared to other hybrid models. It led to a reduction of 77% when compared to the original model. For the 45 high ranking variables, the hybrid model of ridge with the M Bi-Square estimator after heterogeneity has the smallest number of outliers of 57 when compared to other hybrid models. It led to a reduction of 66% when compared to the original model. For the 45 high ranking variables, the hybrid model of LASSO with the M Bi-Square estimator modified heterogeneity has the smallest number of outliers of 25 when compared to other hybrid models. It led to a reduction of 46% when compared to the original model. All the best hybrid models, such as ridge with the M Hampel estimator before heterogeneity, ridge with the M Bi-Square estimator after heterogeneity, and LASSO with the M Bi-Square estimator modified heterogeneity, showed that hybrid models had better significant performance.

Table 4 Comparison between the number and percentage of outliers outside 2 sigma limits for original and hybrid models for before, after and modified heterogeneity for 45 high ranking variables.

Based on the results, the hybrid models of sparse regression for before, after, and modified heterogeneity robust regression with the 45 high ranking variables and a 2- sigma limit can be used efficiently and effectively to reduce the outliers.

Conclusion and future work

This paper proposes a modified sparse regression model that solves the problem of heterogeneity using seaweed big data. The proposed modified sparse method achieves better significant estimation accuracy than the other methods when the heterogeneity problems are identified and the impact investigated. According to72,73, if MAPE is less than 10, there is high prediction accuracy. For a high prediction accuracy in the model, the error should be small, and it stands for the loss function for the regression model in machine learning74. The lower the SSE and MSE, the better the predictive ability of the model. The smaller the MAPE value, the more precise the prediction72,75. In addition, for the hybrid model to test for the presence of outliers, it was shown that LASSO with M Bi-Square estimator achieves a better significant estimation accuracy than the other methods. The importance of the main effects of the drying parameters were also justified in the modified model. In conclusion, the current study proposes LASSO with M Bi-Square estimator for determining the moisture content of the seaweed.

For future studies, impact of heterogeneity using a hybrid model with imbalanced data or missing values can be investigated. To develop the hybrid model, resampling, synthetic minority oversampling techniques to oversample the minority class (SMOTE), balanced bagging classifier, threshold moving with ROC curves or precision-recall curves, or grid search method can be used.