## Abstract

The linear regression is critical for data modelling, especially for scientists. Nevertheless, with the plenty of high-dimensional data, there are data with more explanatory variables than the number of observations. In such circumstances, traditional approaches fail. This paper proposes a modified sparse regression model that solves the problem of heterogeneity using seaweed big data as a use case. The modified heterogeneity models for ridge, LASSO and Elastic net were used to model the data. Robust estimations M Bi-Square, M Hampel, M Huber, MM and S were used. Based on the results, the hybrid model of sparse regression for before, after, and modified heterogeneity robust regression with the 45 high ranking variables and a 2-sigma limit can be used efficiently and effectively to reduce the outliers. The obtained results confirm that the hybrid model of the modified sparse LASSO with the M Bi-Square estimator for the 45 high ranking parameters performed better compared with other existing methods.

### Similar content being viewed by others

## Introduction

Linear regression is commonly applied in every field. The aim of regression analysis is to investigate the relationship between two or more independent variables and a dependent variable. Consider a multiple regression model in matrix form:

where \({\varvec{y}}\) is an \(n\times 1\) vector of response variables, \({\varvec{X}}\) is known as the design matrix of order \(n \times p\), \({\varvec{\beta}}\) is a \(p\times 1\) vector of unknown parameters and \({\varvec{\varepsilon}}\) is a \(n\times 1\) vector of identically and independently distributed errors. The Ordinary Least Squares (OLS) is generally applied to estimate the unknown parameters in the regression model. According to^{1,2}, OLS estimator \(\beta\) is obtained as follows:

By minimizing

One of the problems that can occur is when the number of explanatory variables is greater than the number of observations. Another problem is the issue of outliers. Observations that deviate from the distribution’s general shape or pattern are called outliers^{3,4}. Outliers are common in agricultural data because of natural variation. These outliers add to the standard errors and error variance^{5,6}. Outliers can happen because of many non-homogeneous observations and measurement errors^{7}.

An effective method for analysing data that has been polluted by outliers is robust regression^{8,9,10,11}. Reliable findings can be obtained from robust regression, which can become proficient at spotting outliers. It can get used to seeing anomalies and provide accurate results when anomalies are present^{12}. If the data have outliers, there will be heterogeneity^{13,14}.

Heterogeneity is the degree of variability within the data^{15}. Traditional approaches for linear regression, like the OLS, fail when there are heterogeneity problems because the OLS cannot be calculated when the number of observations is less than the number of parameters (that is, when *n* < *p*) or the estimate becomes inefficient and unstable. Hence, it is crucial to acknowledge the treatment effects of heterogeneity^{16}. Heterogeneity causes the standard errors to be biased and inconsistent. Feczko and Fair^{17} stated that the problem of homogeneity assumption is a big challenge and leads to the heterogeneity problem. Feczko et al.^{18} stated that supervised and unsupervised statistical approaches were emphasized to study heterogeneity problems. Assunção et al.^{19} stated that the issue of heterogeneity is usually considered in regression analysis, the issue of missing data affects the ability to control the significant variables in the estimation of the regression coefficients, which affects the stability of the inference. Gormley et al.^{20} claimed that a vital problem in financial empirical research is the issue of how and whether to control for heterogeneity that is not observed.

Sparse regression can be used to regulate the coefficients of the regression. This will reduce the sampling error and variance and solve the problem of multicollinearity. It is used in many scientific disciplines^{21}. Sparse regression is used to increase predictive accuracy in high-dimensional situations^{22,23,24,25,26,27,28}.

In this paper, a new regression method called the modified heterogeneity sparse regression model is proposed that tackles the problems of heterogeneity. The contributions of the paper are: (i) The modified heterogeneity sparse regression model is introduced. The heterogeneity parameter is identified using the idea of a variance inflation factor, the significant variables are selected, and the main effects of the parameters are included back in the model. (ii) Significant parameter selection using the 15, 25, 35, and 45 high important variables is used. (iii) The impacts of heterogeneity before, after and modified heterogeneity through metric validation are evaluated. (iv) A hybrid model to reduce outliers is developed.

Figure 1 shows the flowchart of the methodologies used to achieve the objectives of the study. It shows the addition of all possible models up to second order and the testing of different assumptions. The 15, 25, 35 and 45 parameters are selected because feature selection can only provide the rank of important variables and does not tell us the number of significant factors. The insignificant parameters are dropped, and the parameters that exhibit heterogeneity are later included in the modified model. Next, the validation metrics are computed using the mean absolute percentage error (MAPE), mean squared error MSE and coefficient of determination (R^{2}). Next, the hybrid models are developed for before, after, and modified heterogeneity using robust methods and machine learning models. The robust methods that are used are M Bi-Square, M Hampel, M Huber, MM and S. Finally, the validation metrics are computed using the 2-sigma limits to identify the number of outliers.

## All possible models and the selected model using 15, 25, 35 and 45 high ranking variables

In this study, the data from 1914 observations collected in Sabah, Malaysia will be applied to investigate the impact of 29 different variables on one dependent variable. By simplyfying \({2}^{29}=536870912\) equations, with 29 variables. Studying a total of 536,870,912 will be impossible and difficult. Therefore, the main effects and the second order interaction are restricted. With this restriction, the study comprised the effects of 435 distinct main effects and single order interaction independent variables on one response variable (the moisture content), which led to big data. The concept of big data has led to advancements in data science using machine learning^{29}. The total number of models can be computed by Eq. (3).

where \(M\) represents the total number of possible models, \(f\) represents the total number of explanatory variables and \(j=\mathrm{1,2},3,\dots ,f\). The 15, 25, 35 and 45 high important ranking variables are selected because, in feature selection, the ranks of important variables and not the number of significant factors are available^{30}. Also, there is no rule for choosing the number of parameters to be incorporated in a prediction model^{31}. Furthermore, the algorithms can only tell us the ranks and not the number of significant parameters^{32}*.* The effect of the interaction variables need to be considered because of their vital role in the real data. The interactions are crucial for describing the nature of interactions in a dataset. The interaction helps to understand the relationships among the variables available in the model and more hypotheses can be tested^{33,34}. Although it is challenging to study the asymptotic and statistical inferences of second order because of their complex covariance structure^{35}.

### The proposed method

The traditional methods normally used for modelling assumed it to be a homogenous model and in machine learning, data scientists overlook heterogeneity. In this study, a method is proposed to identify heterogeneity parameters. Heterogeneity refers to the variability of observations. This variability leads to inconsistent estimates and distort conclusions^{20}. Suppose the multiple linear regression:

where \({Y}_{i}\), \(i=1, 2,\dots ,n\) is the response value for the ith case moisture content, estimates \({\beta }{\prime}s\) are the regression coefficients for the explanatory variables (drying parameter) \({T}{\prime}s\), \({a}_{j}\) denote heterogeneity, for \(j=1, 2,\dots ,f\). That is, the parameters that exhibit heterogeneity and \(\varepsilon\) is the random error. In the equation above, if the estimates of the regression equation are computed and a crucial variable is omitted, then the estimate \(\beta\) will be biased and inconsistent. It is also possible that some variables are correlated with the error term, which violates the assumption of regression. The coefficient of determination can be written as \({R}^{2}=1-\frac{1}{VIF}\). If the R^{2} satisfied certain conditions, then the parameter is said to exhibit heterogeneity. Cheng et al.^{36} stated that the variance inflation factor in multiple regression is used to quantify the level of severity. It can be computed with \({R}_{l}^{2}\) where \({R}_{l}^{2}\) for \(l=\mathrm{1,2},\dots ,p\) denotes the quantity of determination between the \({l{\text{th}}}\) variable \({x}_{l}\) in the predictor matrix and the variables not related to it.

Let

So that \({r}_{xx}\) will be the correlation matrix representing the \({\varvec{X}}\) variables.

Since

The \({VIF}_{l}\) for \(l=\mathrm{1,2},3,\dots ,p-1\) stands for the \({l{\text{th}}}\) diagonal element of \({r}_{{\varvec{X}}{\varvec{X}}}^{-1}\). If the proof for \(l=1\), then the rows and columns \({r}_{{\varvec{X}}{\varvec{X}}}\) can be permutated to obtain the result for the remaining \(l\).

Let

By applying Schur’s complement,

where \({\beta }_{{1X}_{\left(-1\right)}}\) means the regression coefficient of \({X}_{1}\) on \({X}_{2},\dots ,{X}_{p-1}\) except the intercept. For clarity, \({R}_{1}^{2}\) and \({VIF}_{1}\) can be written as:

and \(VIF_{1} = r_{XX}^{ - 1} \left( {1,1} \right) = \frac{1}{{1 - R_{1}^{2} }}\).

### Ridge regression (RR)

In the ridge regression, ridge parameter plays a crucial role in parameter estimation. The ridge regression estimator is an option for the OLS estimator when there is multicollinearity^{37}.

From Eq. (1), if \({y}_{i}\) is the result and \({x}_{i}={\left({x}_{i1},{x}_{i2},{x}_{i3},\dots ,{x}_{ip} \right)}^{T}\) denote the covariate vector for the ith case. The least square is the most common estimation method, where the coefficients \(\beta ={\left({\beta }_{0},{\beta }_{1},{\beta }_{2},\dots ,{\beta }_{p} \right)}^{T}\) have been chosen to minimize the sum of squares residual (SSR)^{38,39,40}. The SSR is given as:

The coefficients of the ridge regression estimate \(\hat{\beta }^{RR}\) minimize^{38,39,40,41,42}:

where \(\lambda \ge 0\) denotes the regularization parameter for controlling the shrinkage parameter. The ridge regression estimates the coefficients that make the SSR small and fit the data well. The term \(\lambda \sum_{j=1}^{p}{\beta }_{j}^{2}\) is the shrinkage penalty. The ridge regression has the advantage over least squares because of the bias-variance trade-off. The prediction error is reduced in ridge regression by shrinking the large regression coefficients.

### LASSO regression (LR)

The drawback of ridge regression is the inclusion of all *p* explanatory variables in the final model. This penalty \(\lambda \sum_{j=1}^{p}{\beta }_{j}^{2}\) in Eq. (5) will decrease all the coefficients towards zero, but not exactly to zero^{43}. The LASSO regression is a relatively easy option to the ridge regression that helps to overcome this drawback. The LASSO coefficients \({\widehat{\beta }}_{\lambda }^{LASSO}\), minimize the quantity \(L^{LR} \left( \beta \right) =\) \(\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \beta_{0} - \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)^{2}\) \(+ \lambda \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right| = SRR + \lambda \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right|\)^{41,42,44,45}. The LASSO utilizes an \({L}_{1}\) penalty instead of an \({L}_{2}\) penalty and LASSO will shrink the estimates of the coefficients towards zero^{28}.

### Elastic net regression (ENR)

The ENR is an expansion of the LASSO that is robust to the high correlations among the explanatory variables and was used to solve the problems of the LASSO regression model, whose variable selection can depend too much on data and is not stable^{41,46,47}. The penalties of Ridge and LASSO were combined to obtain the best. In fact, the aim is to minimize the following loss function^{38,39,40,41,42,44,45,48,49}

where \(\alpha\) is the mixing parameter between LASSO and ridge.

Since the presence of outliers influences the performance of ordinary least squares, to solve the issue, robust regression that performs better and is not easily influenced by outliers has been developed^{50}. These robust methods are MM Estimation, M Estimation, Least Trimmed Squares (LTS), S Estimation, Least Median of Squares (LMS), and Least Absolute Value Method (LAV)^{51,52,53,54,55,56}. This study focuses on M-Estimation, MM Estimation and S Estimation.

### S estimation

S estimators are proposed by Rousseeuw and Yohai^{57}. It is an estimate for the regression related to M-scales. It is built on the residual scale of M estimation. It is used to overcome the weakness of the median by using the residual of the standard deviation. Because the M estimation only makes use of the median for the weighted value, it does not consider data distribution and is not a function of all the data^{58}. According to Salibian-Barrera and Yohai^{59}, the S-estimator can be defined as \({\widehat{\beta }}_{S}={min}_{\beta }{\widehat{\sigma }}_{sd}\left({e}_{1},{e}_{2,}{e}_{3},\dots .,{e}_{n}\right)\) with a robust estimator and fulfilling Eq. (6).

where \({\widehat{\sigma }}_{sd}=\sqrt{\frac{1}{nK}\sum_{i=1}^{n}{w}_{i}{e}_{i}^{2}}\) and \(K=0.199,{w}_{i}={w}_{\sigma \left({u}_{i}\right)=}\frac{\rho \left({u}_{i}\right)}{{u}_{i}^{2}}\) and the estimate of the initial \(\hat{\sigma }_{sd} = \frac{{median\left| {e_{i} - median\left( {e_{i} } \right)} \right|}}{0.6745}\). The solution is achieved by differentiating to β, t.

Ψ is the function as derivatives of ρ:

### MM estimation

MM estimation can be used to estimate the parameters of the regression using S estimation, the scale of the residuals can be minimized from M estimation and then progressed to M estimation. The purpose of MM estimation is to obtain estimates that are more efficient and have a high breakdown point. The breakdown point is used to find the percentage of outliers in the data before the observations affect the model^{58,60}. The solution of the MM estimator is given as:

\(SD_{MM}\) represents the standard deviation from the residual of S estimation and \(\rho\) is the function for Tukey’s biweight defined as:

### Evaluation metric

The suitability and accuracy of the models were evaluated using the mean absolute percentage error (MAPE), mean square error (MSE) and coefficient of determination \({R}^{2}\) in Table 1, *n* is the number of observations, \({y}_{i}\) is the actual value and \({\widehat{y}}_{i}\) is the forecast value.

## Results and discussion

The variability of the 29 main drying parameters is shown in Fig. 2 and the heterogeneity of drying parameters among the seaweed drying parameters is identified using the proposed method. The assumptions of linearity between the dependent and independent parameters are verified, and it was found that no linear relationship exists between them. The independence of observations is verified using the Durbin Watson test. The results show that the p-value of 0, is less than the of significance α = 0.05, which reveals that the residuals are autocorrelated and observations are dependent. Furthermore, the normality assumption is verified using the Kolmogorov–Smirnov test. With a p-value of 0 that is less than the level of significance α = 0.05, there is sufficient proof to say that the residuals do not represent a normal distribution.

The MAPE, MSE, and \({R}^{2}\) are shown in Table 2 for the 15, 25, 35 and 45 high-ranking variables selected for before, after and modified heterogeneity sparse regression models. The accuracy of the proposed models is unique according to the MAPE for the different models. The elastic net BH has an MAPE that is comparable to the LASSO BH, and LASSO BH outperforms the other models (that is, ridge BH, elastic net BH, ridge AH, LASSO AH, elastic net AH, ridge MH, LASSO MH, and elastic net MH). Generally, the higher the number of high-ranking variables selected, the better the predictive accuracy. This is comparable to^{65,66} where random LASSO performed better. With these results, LASSO BH with 45 high ranking variables is the best to determine the moisture content of the seaweed. The value MAPE (8.149872) denotes the average percentage error between the moisture content removal of the seaweed predicted by the model and the real value. MAPE, MSE, and SSE are commonly used in model validation for model accuracy. The MAPE, MSE, and SSE are used to exemplify the agreement between the observations and predictions. In addition, they can be used to select the best model among competing models^{67}. Particularly, the explained MAPE, MSE and SSE quantify the variability of the error between the actual data and the predicted value. The value of R-square (0.8845778), implies that 88.45778% of the variance in the dependent variable moisture content can be explained by the selected drying parameters. Figure 3 displays the boxplots of the MAPEs for the different 9 models using only the 45 high-ranking variables. It includes outliers and is reflected in the outlying estimates for the MAPEs. The elastic net BH has the lowest value for the average MAPE, where the averages are indicated by red circles in Fig. 3.

To observe and detect the outliers, the sigma limits and standardized residual plots are observed for each model in Figs. 4, 5 and 6 which show the standardized residual plots of ridge, LASSO, and elastic net respectively, for 45 high ranking variables. The percentage of outliers is calculated using the number of observations outside the 2-sigma limit.

Table 3 reveals the number of outliers with their respective percentages using the 2-sigma limits. Data can have outliers because of many factors that cannot be explained, and the outliers can affect the predictive accuracy^{5,6,68,69}. Data with outliers makes the least squares estimator inefficient, unsteady, unstable, and unreliable^{70}. In the area of agriculture, data with outliers is frequent^{58,71}.

For the 15 high ranking variables, ridge BH has the lowest outliers with 77 observations, this denotes 4.02% of the total observations. For the 25 high ranking variables, ridge BH has the lowest outliers with 90 observations, this denotes 4.70% of the total observations. For the 35 high ranking variables, ridge MH has the lowest outliers with 91 observations, this denotes 4.75% of the total observations. For the 45 high ranking variables, ridge MH has the lowest outliers with 90 observations, this denotes 4.70% of the total observations.

Table 4 shows the number of outliers with their percentages for the hybrid model using the hybrid modified sparse regression with robust estimator. For the 45 high ranking variables, the hybrid model of ridge with the M Hampel estimator before heterogeneity has the smallest number of outliers of 45 when compared to other hybrid models. It led to a reduction of 77% when compared to the original model. For the 45 high ranking variables, the hybrid model of ridge with the M Bi-Square estimator after heterogeneity has the smallest number of outliers of 57 when compared to other hybrid models. It led to a reduction of 66% when compared to the original model. For the 45 high ranking variables, the hybrid model of LASSO with the M Bi-Square estimator modified heterogeneity has the smallest number of outliers of 25 when compared to other hybrid models. It led to a reduction of 46% when compared to the original model. All the best hybrid models, such as ridge with the M Hampel estimator before heterogeneity, ridge with the M Bi-Square estimator after heterogeneity, and LASSO with the M Bi-Square estimator modified heterogeneity, showed that hybrid models had better significant performance.

Based on the results, the hybrid models of sparse regression for before, after, and modified heterogeneity robust regression with the 45 high ranking variables and a 2- sigma limit can be used efficiently and effectively to reduce the outliers.

## Conclusion and future work

This paper proposes a modified sparse regression model that solves the problem of heterogeneity using seaweed big data. The proposed modified sparse method achieves better significant estimation accuracy than the other methods when the heterogeneity problems are identified and the impact investigated. According to^{72,73}, if MAPE is less than 10, there is high prediction accuracy. For a high prediction accuracy in the model, the error should be small, and it stands for the loss function for the regression model in machine learning^{74}. The lower the SSE and MSE, the better the predictive ability of the model. The smaller the MAPE value, the more precise the prediction^{72,75}. In addition, for the hybrid model to test for the presence of outliers, it was shown that LASSO with M Bi-Square estimator achieves a better significant estimation accuracy than the other methods. The importance of the main effects of the drying parameters were also justified in the modified model. In conclusion, the current study proposes LASSO with M Bi-Square estimator for determining the moisture content of the seaweed.

For future studies, impact of heterogeneity using a hybrid model with imbalanced data or missing values can be investigated. To develop the hybrid model, resampling, synthetic minority oversampling techniques to oversample the minority class (SMOTE), balanced bagging classifier, threshold moving with ROC curves or precision-recall curves, or grid search method can be used.

## Data availability

All data are included in this article.

## References

Gujarati, D. N. & Porter, D. N.

*Basic Econometrics*4th edn. (The McGraw-Hill Companies, 2004).Obadina, O. G., Adedotun, A. F. & Odusanya, O. A. Ridge estimation’s effectiveness for multiple linear regression with multicollinearity: An investigation using Monte-Carlo simulations.

*J. Niger. Soc. Phys. Sci.***3**(4), 278–281. https://doi.org/10.46481/jnsps.2021.304 (2021).Yusuf, A. B., Dima, R. M. & Aina, S. K. Optimized breast cancer classification using feature selection and outliers detection.

*J. Niger. Soc. Phys. Sci.***3**(4), 298–307. https://doi.org/10.46481/jnsps.2021.331 (2021).Ibidoja, O. J., Shan, F. P., Sulaiman, J. & Ali, M. K. M. Robust M-estimators and machine learning algorithms for improving the predictive accuracy of seaweed contaminated big data.

*J. Nig. Soc. Phys. Sci***5**, 1137. https://doi.org/10.46481/jnsps.2022.1137 (2023).Rajarathinam, A. & Vinoth, B. Outlier detection in simple linear regression models and robust regression—A case study on wheat production data.

*Int. J. Sci. Res.***3**(2), 531–536 (2014).Lim, H. Y., Fam, P. S., Javaid, A. & Ali, M. K. M. Ridge regression as efficient model selection and forecasting of fish drying using v-groove hybrid solar drier.

*Pertanika J. Sci. Technol.***28**(4), 1179–1202. https://doi.org/10.47836/pjst.28.4.04 (2020).Khezrimotlagh, D., Cook, W. D. & Zhu, J. A nonparametric framework to detect outliers in estimating production frontiers.

*Eur. J. Oper. Res.***286**(1), 375–388. https://doi.org/10.1016/j.ejor.2020.03.014 (2020).Kepplinger, D. Robust variable selection and estimation via adaptive elastic net S-estimators for linear regression.

*Comput. Stat. Data Anal.***183**, 107730. https://doi.org/10.1016/j.csda.2023.107730 (2023).Mukhtar, M. K., Ali, M., Javaid, A., Ismail, M. T. & Fudholi, A. Accurate and hybrid regularization—Robust regression model in handling multicollinearity and outlier using 8SC for big data.

*Math. Model. Eng. Probl.***8**(4), 547–556. https://doi.org/10.18280/mmep.080407 (2021).Mukhtar, M.

*et al.*Hybrid model in machine learning–robust regression applied for sustainability agriculture and food security.*Int. J. Electr. Comput. Eng.***12**(4), 4457–4468. https://doi.org/10.11591/ijece.v12i4.pp4457-4468 (2022).Javaid, A., Ismail, M. T. & Ali, M. K. M. Comparison of sparse and robust regression techniques in efficient model selection for moisture ratio removal of seaweed using solar drier.

*Pertanika J. Sci. Technol***28**(2), 609–625 (2020).Muthukrishnan, R., Reka, R. & Boobalan, E. D. Robust regression procedure for model fitting with application to image analysis.

*Int. J. Stat. Syst.***12**(1), 79 (2017).Collins, R. E., Carpenter, S. D. & Deming, J. W. Spatial heterogeneity and temporal dynamics of particles, bacteria, and pEPS in Arctic winter sea ice.

*J. Mar. Syst.***74**(3–4), 902–917. https://doi.org/10.1016/j.jmarsys.2007.09.005 (2008).Rowe, S. J., White, I. M. S., Avendaño, S. & Hill, W. G. Genetic heterogeneity of residual variance in broiler chickens.

*Genet. Sel. Evolut.***38**(6), 617–635. https://doi.org/10.1051/gse:2006025 (2006).Ibidoja, O. J., Shan, F. P., Sulaiman, J. & Ali, M. K. M. Detecting heterogeneity parameters and hybrid models for precision farming.

*J. Big Data*https://doi.org/10.1186/s40537-023-00810-8 (2023).Ranjbar, S., Salvati, N. & Pacini, B. Estimating heterogeneous causal effects in observational studies using small area predictors.

*Comput. Stat. Data Anal.*https://doi.org/10.1016/j.csda.2023.107742 (2023).Feczko, E. & Fair, D. A. Methods and challenges for assessing heterogeneity.

*Biol. Psychiatry***88**(1), 9–17. https://doi.org/10.1016/j.biopsych.2020.02.015 (2020).Feczko, E.

*et al.*The heterogeneity problem: Approaches to identify psychiatric subtypes.*Trends Cognit. Sci.***23**(7), 584–601. https://doi.org/10.1016/j.tics.2019.03.009 (2019).Assunção, J., Burity, P. & Medeiros, M. C. Unobserved heterogeneity in regression models: A semiparametric approach based on nonlinear sieves.

*Braz. Rev. Econom.***35**(1), 47–63 (2015).Gormley, T. A. & Matsa, D. A. Common errors: How to (and Not to) control for unobserved heterogeneity.

*Rev. Financ. Stud.***27**(2), 617–661. https://doi.org/10.1093/rfs/hht047 (2014).Ahrens, A., Hansen, C. B. & Schaffer, M. E. lassopack: Model selection and prediction with regularized regression in Stata.

*Stata J.***20**(1), 176–235. https://doi.org/10.1177/1536867X20909697 (2020).Ma, S., Fildes, R. & Huang, T. Demand forecasting with high dimensional data: The case of SKU retail sales forecasting with intra- and inter-category promotional information.

*Eur. J. Oper. Res.***249**(1), 245–257. https://doi.org/10.1016/j.ejor.2015.08.029 (2016).Zhang, Y., Zhu, R., Chen, Z., Gao, J. & Xia, D. Evaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data.

*Eur. J. Oper. Res.***290**(1), 235–247. https://doi.org/10.1016/j.ejor.2020.09.028 (2021).Pun, C. S. & Wong, H. Y. A linear programming model for selection of sparse high-dimensional multiperiod portfolios.

*Eur. J. Oper. Res.***273**(2), 754–771. https://doi.org/10.1016/j.ejor.2018.08.025 (2019).Vincent, M. & Hansen, N. R. Sparse group lasso and high dimensional multinomial classification.

*Comput. Stat. Data Anal.***71**, 771–786. https://doi.org/10.1016/j.csda.2013.06.004 (2014).Belloni, A. & Chernozhukov, V.

*High Dimensional Sparse Econometric Models: An Introduction*(Springer, 2011).Wang, Q. & Yin, X. A nonlinear multi-dimensional variable selection method for high dimensional data: Sparse MAVE.

*Comput. Stat. Data Anal.***52**(9), 4512–4520. https://doi.org/10.1016/j.csda.2008.03.003 (2008).Algamal, Z. Y., Lee, M. H. & Al-Fakih, A. M. High-dimensional quantitative structure-activity relationship modeling of influenza neuraminidase a/PR/8/34 (H1N1) inhibitors based on a two-stage adaptive penalized rank regression.

*J. Chemom.***30**(2), 50–57. https://doi.org/10.1002/cem.2766 (2016).Arif, A., Alghamdi, T. A., Khan, Z. A. & Javaid, N. Towards efficient energy utilization using big data analytics in smart cities for electricity theft detection.

*Big Data Res.*https://doi.org/10.1016/j.bdr.2021.100285 (2022).Drobnič, F., Kos, A. & Pustišek, M. On the interpretability of machine learning models and experimental feature selection in case of multicollinear data.

*Electronics*https://doi.org/10.3390/electronics9050761 (2020).Chowdhury, M. Z. I. & Turin, T. C. Variable selection strategies and its importance in clinical prediction modelling.

*Fam. Med. Community Health*https://doi.org/10.1136/fmch-2019-000262 (2020).Kaneko, H. Examining variable selection methods for the predictive performance of regression models and the proportion of selected variables and selected random variables.

*Heliyon***7**(6), 1–12. https://doi.org/10.1016/j.heliyon.2021.e07356 (2021).Whisman, M. A. & McClelland, G. H. Designing, testing, and interpreting interactions and moderator effects in family research.

*J. Fam. Psychol.***19**(1), 111–120. https://doi.org/10.1037/0893-3200.19.1.111 (2005).Aiken, L. S., West, S. G. & Reno, R. R.

*Multiple Regression: Testing and Interpreting Interactions*(Sage, 1991).Hao, N. & Zhang, H. H. A note on high dimensional linear regression with interactions.

*Am. Stat.***71**(4), 291–297 (2017).Cheng, J., Sun, J., Yao, K., Xu, M. & Cao, Y. A variable selection method based on mutual information and variance inflation factor.

*Spectrochim. Acta A Mol. Biomol. Spectrosc.*https://doi.org/10.1016/j.saa.2021.120652 (2022).Hoerl, A. E. & Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems.

*Technometrics***42**, 80 (1970).Yildirim, H. & Revan Özkale, M. The performance of ELM based ridge regression via the regularization parameters.

*Expert Syst. Appl.***134**, 225–233. https://doi.org/10.1016/j.eswa.2019.05.039 (2019).Moreno-Salinas, D., Moreno, R., Pereira, A., Aranda, J. & de la Cruz, J. M. Modelling of a surface marine vehicle with kernel ridge regression confidence machine.

*Appl. Soft Comput. J.***76**, 237–250. https://doi.org/10.1016/j.asoc.2018.12.002 (2019).Melkumova, L. E., & Shatskikh S. Y., Comparing Ridge and LASSO estimators for data analysis. In

*Procedia Engineering*, 746–755 (Elsevier Ltd, 2017). https://doi.org/10.1016/j.proeng.2017.09.615.García-Nieto, P. J., García-Gonzalo, E. & Paredes-Sánchez, J. P. Prediction of the critical temperature of a superconductor by using the WOA/MARS, Ridge, Lasso and Elastic-net machine learning techniques.

*Neural Comput. Appl.***33**(24), 17131–17145. https://doi.org/10.1007/s00521-021-06304-z (2021).Hastie, T., Tibshirani, R. & Friedman, J.

*The Elements of Statistical Learning*(Springer, 2011).Exterkate, P., Groenen, P. J. F., Heij, C. & van Dijk, D. Nonlinear forecasting with many predictors using kernel ridge regression.

*Int. J. Forecast.***32**(3), 736–753. https://doi.org/10.1016/j.ijforecast.2015.11.017 (2016).Melkumova, L. E., & Shatskikh S. Y., Comparing Ridge and LASSO estimators for data analysis. In

*Procedia Engineering*, 746–755 (Elsevier Ltd, 2017). https://doi.org/10.1016/j.proeng.2017.09.615.Spencer, B., Alfandi, O., & Al-Obeidat, F. A refinement of Lasso regression applied to temperature forecasting. In

*Procedia Computer Science*, 728–735 (Elsevier B.V., 2018). https://doi.org/10.1016/j.procs.2018.04.127.Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent.

*J. Stat. Soft.*https://doi.org/10.18637/jss.v033.i01 (2010).Ogutu, J. O., Schulz-Streeck, T. & Piepho, H. P. Genomic selection using regularized linear regression models: Ridge regression, lasso, elastic net and their extensions.

*BMC Proc.*https://doi.org/10.1186/1753-6561-6-S2-S10 (2012).Wang, S., Ji, B., Zhao, J., Liu, W. & Xu, T. Predicting ship fuel consumption based on LASSO regression.

*Transp. Res. D Transp. Environ.***65**, 817–824. https://doi.org/10.1016/j.trd.2017.09.014 (2018).Al-Obeidat, F., Spencer, B. & Alfandi, O. Consistently accurate forecasts of temperature within buildings from sensor data using ridge and lasso regression.

*Future Gener. Comput. Syst.***110**, 382–392. https://doi.org/10.1016/j.future.2018.02.035 (2020).Jegede, S. L., Lukman, A. F., Ayinde, K. & Odeniyi, K. A. Jackknife Kibria-Lukman M-estimator: Simulation and application.

*J. Niger. Soc. Phys. Sci.***4**(2), 251–264. https://doi.org/10.46481/jnsps.2022.664 (2022).Rousseeuw, P. J.

*Robust Estimation and Identifying Outliers*(Edegem, 1990).Berk, R. A. A Primer on Robust Regression. In

*Modern Methods of Data Analysis*. 292–323, (Sage Publications, Newbury Park, 1990).Almetwally, E. & Almongy, H. Comparison between M-estimation, S-estimation, and MM estimation methods of robust estimation with application and simulation.

*Int. J. Math. Arch.***9**(11), 55 (2018).Mohamed, A. E., Almongy, H. M. & Mohamed, A. H. Comparison between M-estimation, S-estimation, and MM estimation methods of robust estimation with application and simulation.

*Int. J. Math. Arch.***9**(11), 55 (2018).Alma, Ö. G. Comparison of robust regression methods in linear regression.

*Int. J. Contemp. Math. Sci.***6**(9), 409–421 (2011).Begashaw, G. B. & Yohannes, Y. B. Review of outlier detection and identifying using robust regression model.

*Int. J. Syst. Sci. Appl. Math.***5**(1), 4–11. https://doi.org/10.11648/j.ijssam.20200501.12 (2020).Rousseeuw, P. J., & Yohai, V. J.

*Robust Regression by Mean of S - estimators, Robust and Nonlinear Time Series Analysis*. Time Series Analysis, New York, 256–274 (1984).Susanti, Y., Pratiwi, H., Sulistijowati, H. & Liana, T. M estimation, s estimation, and MM estimation in robust regression.

*Int. J. Pure Appl. Math.***91**(3), 349–360. https://doi.org/10.12732/ijpam.v91i3.7 (2014).Salibian-Barrera, M. & Yohai, V. J. A fast algorithm for S-regression estimates.

*J. Comput. Gr. Stat.***15**(2), 414–427. https://doi.org/10.1198/106186006X113629 (2006).Chen, C., & Morgan, J. P. Robust regression and outlier detection with the ROBUSTREG. In

*Paper 265–27 Robust regression and outlier detection with Proceedings of the Twenty-Seventh Annual SAS Users Group International Conference*(2002).Kim, S. & Kim, H. A new metric of absolute percentage error for intermittent demand forecasts.

*Int. J. Forecast.***32**(3), 669–679. https://doi.org/10.1016/J.IJFORECAST.2015.12.003 (2016).Chicco, D., Warrens, M. J. & Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation.

*PeerJ Comput. Sci.***7**, 1–24. https://doi.org/10.7717/PEERJ-CS.623 (2021).Gouda, S. G., Hussein, Z., Luo, S. & Yuan, Q. Model selection for accurate daily global solar radiation prediction in China.

*J. Clean. Prod.***221**, 132–144. https://doi.org/10.1016/j.jclepro.2019.02.211 (2019).Ibidoja, O. J., Ajare, E. O. & Jolayemi, E. T. Reliability measures of academic performance.

*IJSGS***2**(4), 59–64 (2016).Kumar, S., Attri, S. D. & Singh, K. K. Comparison of Lasso and stepwise regression technique for wheat yield prediction.

*J. Agrometeorol.***21**(2), 188 (2019).Hammami, D., Lee, T. S., Ouarda, T. B. M. J. & Le, J. Predictor selection for downscaling GCM data with LASSO.

*J. Geophys. Res. Atmos.*https://doi.org/10.1029/2012JD017864 (2012).Liu, Y., Chen, W., Arendt, P. & Huang, H. Z. Toward a better understanding of model validation metrics.

*J. Mech. Des. Trans. ASME*https://doi.org/10.1115/1.4004223 (2011).Al-Dabbagh, Z. T. & Algamal, Z. Y. A robust quantitative structure–activity relationship modelling of influenza neuraminidase a/PR/8/34 (H1N1) inhibitors based on the rank-bridge estimator.

*SAR QSAR Environ. Res.***30**(6), 417–428. https://doi.org/10.1080/1062936X.2019.1613261 (2019).Al-Dabbagh, Z. T. & Algamal, Z. Y. Least absolute deviation estimator-bridge variable selection and estimation for quantitative structure–activity relationship model.

*J. Chemom.*https://doi.org/10.1002/cem.3139 (2019).Dawoud, I. & Abonazel, M. R. Robust Dawoud-Kibria estimator for handling multicollinearity and outliers in the linear regression model.

*J. Stat. Comput. Simul.***91**(17), 3678–3692. https://doi.org/10.1080/00949655.2021.1945063 (2021).Susanti, Y. & Pratiwi, D. Modeling of Soybean production in Indonesia using robust regression.

*Bionatura***14**(2), 148–155 (2012).Sumari, A. D. W., Charlinawati, D. S., & Ariyanto, Y. A simple approach using statistical-based machine learning to predict the weapon system operational readiness. In

*The 1st International Conference on Data Science and Official Statistics*343–351 (2021).Ibidoja, O. J., Shan, F. P., Suheri, M. E., Sulaiman, J. & Ali, M. K. M. Intelligence system via machine learning algorithms in detecting the moisture content removal parameters of seaweed big data.

*Pertanika J. Sci. Technol.***31**(6), 2783–2803. https://doi.org/10.47836/pjst.31.6.09 (2023).Jierula, A., Wang, S., Oh, T. M. & Wang, P. Study on accuracy metrics for evaluating the predictions of damage locations in deep piles using artificial neural networks with acoustic emission data.

*Appl. Sci.***11**(5), 1–21. https://doi.org/10.3390/app11052314 (2021).Lu, H. & Ma, X. Hybrid decision tree-based machine learning models for short-term water quality prediction.

*Chemosphere*https://doi.org/10.1016/j.chemosphere.2020.126169 (2020).

## Acknowledgements

The authors are grateful to the “Ministry of Higher Education Malaysia for Fundamental Research Grant Scheme with Project Code: FRGS/1/2022/STG06/USM/02/13” for their support in this research.

## Author information

### Authors and Affiliations

### Contributions

O.J.I.: Conceptualization; Data curation; formal analysis; methodology; project writing—original draft; and writing. F.P.S.: Supervision; writing—review & editing. M.K.M.A.: Data curation, funding acquisition, Writing—review & editing, supervision, writing—review & editing.

### Corresponding authors

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Additional information

### Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary Information

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

## About this article

### Cite this article

Ibidoja, O.J., Shan, F.P. & Ali, M.K.M. Modified sparse regression to solve heterogeneity and hybrid models for increasing the prediction accuracy of seaweed big data with outliers.
*Sci Rep* **14**, 17599 (2024). https://doi.org/10.1038/s41598-024-60612-7

Received:

Accepted:

Published:

DOI: https://doi.org/10.1038/s41598-024-60612-7

### Keywords

## Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.