Introduction

A variety of models are used in statistical modeling. Often the focus is to identify the single best model, which describes the data well while being parsimonious. The model selection procedure involves fitting a set of competing models and then selecting the best model by comparing the values of their goodness-of-fit statistics, their prediction loss, or both of these two. Several studies on model selection procedures have concluded that these methods depend on maximum likelihood-type or least squares approaches1,2,3,4,5,6 and are possibly affected by the presence of outlying observations in the data. Robust model selection methods aim to work well in situations when some of the observations are outliers and/or the error distribution is not normal. Several robust model selection procedures have been proposed in the literature. To cope with these problems in model selection, different approaches are proposed. Some of them are based on robust modifications of well-known standard criteria such as Akaike information criterion or Mallows’ \(C_{p}\) criterion, or on various resampling techniques, like bootstrap or cross-validation7,8,9,10,11,12,13,14,15,16,17,18. The main objective of this research work is to propose a modified version of19 for model selection in the presence of outliers.

Suppose that we have a column vector of n responses \(Y{\mathbf{ = (y}}_{{\mathbf{1}}} {\mathbf{,y}}_{{\mathbf{2}}} {\mathbf{,}}...{\mathbf{,y}}_{{\mathbf{n}}} {\mathbf{)}}^{{\mathbf{T}}} {\mathbf{,}}\) and \(X\) is an \({\mathbf{n }} \times {\mathbf{ p}}\) design matrix. Let \({{\varvec{\upalpha}}}\) denote any subset of size \({\mathbf{p}}_{{{\varvec{\upalpha}}}}\) from \({\mathbf{\{ }}\,{\mathbf{1}}\,{\mathbf{,}}\,\,{\mathbf{2}}\,{\mathbf{,}}\,{\mathbf{.}}\,\,{\mathbf{.}}\,\,{\mathbf{.}}\,\,{\mathbf{,}}\,\,{\mathbf{p\} }}\), and let \({\mathbf{X}}_{{{\varvec{\upalpha}}}}\) is an \({\mathbf{n \times p}}_{{{\varvec{\upalpha}}}}\) matrix. Let \({\mathbf{x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}}\) denote the \({\mathbf{i}}\)th row vector of the design matrix \({\mathbf{X}}_{{{\varvec{\upalpha}}}}\). Then the linear regression model corresponding to model α is given by

$${\mathbf{y}}_{{\mathbf{i}}} {\mathbf{ = x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}} {{\varvec{\upbeta}}}_{{{\varvec{\upalpha}}}} {\mathbf{ + \varepsilon }}_{{{\mathbf{\alpha i}}}} {\mathbf{,}}\quad {\mathbf{i}}\,\,{\mathbf{ = }}\,\,{\mathbf{1,}}\,\,{\mathbf{2,}}\,\,...{\mathbf{,}}\,{\mathbf{n}}$$
(1)

where \({\text{X}}_{\alpha }\) and \({{\varvec{\upvarepsilon}}}_{{{{\varvec{\upalpha}}}i}} = {\mathbf{(\varepsilon }}_{{\alpha {\mathbf{1}}}} {\mathbf{,\varepsilon }}_{{\alpha {\mathbf{2}}}} {\mathbf{,}}...{\mathbf{,\varepsilon }}_{{\alpha {\mathbf{n}}}} {\mathbf{)}}^{{\mathbf{T}}}\) are independent, and the errors \({{\varvec{\upvarepsilon}}}_{{{\mathbf{\alpha i}}}}\) are assumed to have location 0 and scale 1, \({{\varvec{\upbeta}}}_{{{\varvec{\upalpha}}}}\) is an unknown \({\mathbf{p}}_{{{\varvec{\upalpha}}}}\)-vector of regression coefficients. Let \({\rm A}\) represent a collection of candidate models. The interest here is to select a model \({{\varvec{\upalpha}}}\) from \({\rm A}\) based on the specified properties of the corresponding fit. To fit the linear regression model, the MM-estimator of20 was adopted, which combines excellent robustness properties along with high efficiency in the absence of outliers in the data. In model selection, three aspects are generally considered i.e., specifying an estimator, fitting models by using the specified estimator and finally, the fitted models are compared. Furthermore, for each of the models \({{\varvec{\upalpha}}}\) the approach can be extended by considering various types of estimators like LS-estimator, M-estimator, and MM-estimator etc. The models are indexed by \({{\varvec{\upalpha}}} \in {\rm A}\) and estimate \({{\varvec{\upbeta}}}_{{{\varvec{\upalpha}}}}\) by the estimator \({\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}}\).

The following two minimal requirements for a good model are discussed by19:

  1. (i)

    it has the capability to fit the sample data y and X reasonably well, and

  2. (ii)

    it has the ability to predict future observations with great accuracy.

The ability of a model to fit the sample data y and X is measured by applying a penalized loss function and the expected prediction loss is used to measure the ability to predict future observations. It has been found in the literature that bootstrapping a robust estimator encounters some difficulties in the presence of outliers. For robust regression, an m-out-of-n paired bootstrap approach is proposed by12. Their study findings revealed that implementing the bootstrap procedure directly to a data set containing outliers, generally, fails due to two reasons: (1) The use of \(\rho (x) = x^{2}\), which is non-robust against outliers, and (2) the bootstrap samples may contain a high proportion of outliers as compared to the original data set. Müller and Welsh19 addressed both of these issues by using stratified bootstrap with appropriate choice of \(\rho (.)\) in the presence of outliers. Their approach ensured that one can obtain bootstrap samples that are similar to the sample data. According to their approach, bootstrap samples are constructed in such a manner that the residuals distribution for each bootstrap sample will reflect the relatively same residuals distribution observed in the original data. Their strategy seems to solve the issue well in practice.

Our objective in this paper is to pursue the investigation in19 and make some refinements, by utilizing the concept of out-of-bag bootstrap to develop a robust model selection criterion which deals with outliers and heavy tailed error distributions. The out-of-bag (OOB) observations are those which are not part of the bootstrap sample. These OOB observations can be used for estimating the prediction error, yielding the so-called OOB error. This type of error is often claimed to be an unbiased estimator for the true error rate21,22.

The rest of the paper is organized as follows: We discuss the existing robust model selection criteria in “Robust model selection criteria” section. Section “The proposed robust model selection criterion” describes a proposed robust model selection criterion. We show the performance of our modified robust criterion via simulation studies in “Simulation studies” section. We present a data examplein “Data example (Stack loss data)” section and conclude with a short discussion in “Conclusion” section.

Robust model selection criteria

In this section, we discuss the existing robust model selection criteria based on robust expected prediction loss. Consider a vector of n responses \({\mathbf{y}}_{{\mathbf{i}}} {\mathbf{ = (y}}_{{\mathbf{1}}} {\mathbf{,y}}_{{\mathbf{2}}} {\mathbf{,}}...{\mathbf{,y}}_{{\mathbf{n}}} {\mathbf{)}}^{{\mathbf{T}}}\) and the design matrix \({\mathbf{X = (x}}_{{\mathbf{1}}} {\mathbf{,x}}_{{\mathbf{2}}} {\mathbf{,}}...{\mathbf{,x}}_{{\mathbf{n}}} {\mathbf{)}}^{{\mathbf{T}}}\). The conditional expected prediction loss of a model α for a given non- negative loss function \(\rho (.)\) is calculated by

$${\mathbf{M}}^{{{\mathbf{PE}}}} {\mathbf{(\alpha ) = }}\frac{{{{\varvec{\upsigma}}}^{{\mathbf{2}}} }}{{\mathbf{n}}}{\mathbf{E}}\left[ {\sum\limits_{{{\mathbf{i = 1}}}}^{{\mathbf{n}}} {{\varvec{\uprho}}} \left\{ {\frac{{{\mathbf{z}}_{{\mathbf{i}}} {\mathbf{ - x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}}^{{}} }}{{{\varvec{\upsigma}}}}} \right\}\left| {{\mathbf{y,X}}} \right.} \right]$$
(2)

where \({\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}}\) is the estimator of \({{\varvec{\upbeta}}}_{{{\varvec{\upalpha}}}}\), \({\mathbf{z = (z}}_{{\mathbf{1}}} {\mathbf{,z}}_{{\mathbf{2}}} {\mathbf{,}}...{\mathbf{,z}}_{{\mathbf{n}}} {\mathbf{)}}^{{\mathbf{T}}}\) is a vector of future responses at X, independent of y, and \(\sigma\) is the measure of spread for a given data. Initially, this type of prediction loss was introduced by5 as a model selection criterion by using a loss function \(\rho (x) = \frac{{x^{2} }}{2}\) in the least squares regression.

To select a model \(\alpha\) from a set \({\rm A}\),19 proposed the following criterion function

$${\mathbf{M}}_{{\mathbf{n}}} {\mathbf{(\alpha ) = }}\frac{{{{\varvec{\upsigma}}}^{{\mathbf{2}}} }}{{\mathbf{n}}}\left[ {\sum\limits_{{{\mathbf{i = 1}}}}^{{\mathbf{n}}} {{{\varvec{\uprho}}}\left\{ {\frac{{{\mathbf{y}}_{{\mathbf{i}}} {\mathbf{ - x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}} }}{{{\varvec{\upsigma}}}}} \right\}} {\mathbf{ + \delta (n)p}}_{{{\varvec{\upalpha}}}} {\mathbf{ + E}}\sum\limits_{{{\mathbf{i = 1}}}}^{{\mathbf{n}}} {{{\varvec{\uprho}}}\left\{ {\frac{{{\mathbf{z}}_{{\mathbf{i}}} {\mathbf{ - x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}} }}{{{\varvec{\upsigma}}}}} \right\}\left| {{\mathbf{y,X}}} \right.} } \right]$$
(3)

Following5,19 estimated the unknown distribution of the data by usingan m-out-of-n stratified bootstrap procedure, whereas the penalized in-sample term in (3) is estimated directly. The estimated selection criteria functions are given by

$${\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PE}}}} {\mathbf{(\alpha ) = }}\frac{{{\hat{\mathbf{\sigma }}}^{{\mathbf{2}}} }}{{\mathbf{n}}}{\mathbf{E}}_{{\mathbf{*}}} \left[ {\sum\limits_{{{\mathbf{i = 1}}}}^{{\mathbf{n}}} {{\varvec{\uprho}}} \left\{ {\frac{{{\mathbf{y}}_{{\mathbf{i}}} {\mathbf{ - x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{\mathbf{\alpha ,m}}}}^{{\mathbf{*}}} }}{{{\hat{\mathbf{\sigma }}}}}} \right\}} \right]$$
(4)
$${\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PPE}}}} {\mathbf{(\alpha ) = }}\frac{{{\hat{\mathbf{\sigma }}}^{{\mathbf{2}}} }}{{\mathbf{n}}}\left[ {\sum\limits_{{{\mathbf{i = 1}}}}^{{\mathbf{n}}} {{{\varvec{\uprho}}}\left\{ {\frac{{{\mathbf{y}}_{{\mathbf{i}}} {\mathbf{ - x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}} }}{{{\hat{\mathbf{\sigma }}}}}} \right\}} {\mathbf{ + \delta (n)p}}_{{{\varvec{\upalpha}}}} {\mathbf{ + E}}_{{\mathbf{*}}} \sum\limits_{{{\mathbf{i = 1}}}}^{{\mathbf{n}}} {{{\varvec{\uprho}}}\left\{ {\frac{{{\mathbf{y}}_{{\mathbf{i}}} {\mathbf{ - x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{\mathbf{\alpha ,m}}}}^{{\mathbf{*}}} }}{{{\hat{\mathbf{\sigma }}}}}} \right\}} } \right]$$
(5)

where \({\hat{\mathbf{\beta }}}_{{{\mathbf{\alpha ,m}}}}^{{\mathbf{*}}}\) is the bootstrap estimate of \({\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}}^{{}}\), \(E_{*}\) denotes expectation with respect to the bootstrap distribution and m is the number of distinct observations in the bootstrap sample which satisfies the conditions given by

$$m \to \infty \;{\text{and}}\;\frac{m}{\sqrt n } \to 0\;{\text{as}}\;n \to \infty .$$

The criterion function in (4) was modified by18 using the following steps:

  1. (i)

    calculate and order the residuals,

  2. (ii)

    set the number of strata S at between 3 and 8 depending on the sample size n,

  3. (iii)

    set stratum boundaries of the residuals,

  4. (iv)

    allocate observations into different strata so that observations in the extreme tail are kept in lower or upper tail strata and other strata comprising the remaining observations,

  5. (v)

    sample rows of (y,X) independently with replacement from each stratum so that total bootstrap sample of size is \(m\,( \le n),\)

  6. (vi)

    construct the estimator \(\hat{\beta }_{\alpha ,m}^{*}\) from data obtained in step (v),

  7. (vii)

    calculate the criterion function \({\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PE*}}}} {\mathbf{(\alpha )}}\) from n-m observations i.e., m observations used to obtain \({\hat{\mathbf{\beta }}}_{{{\mathbf{\alpha ,m}}}}^{{\mathbf{*}}}\) are not included when calculating \({\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PE*}}}} {\mathbf{(\alpha )}}\),

  8. (viii)

    repeat the steps (vi) and (vii) K independent times and then estimate the modified robust expected prediction loss by

    $${\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PE*}}}} {\mathbf{(\alpha ) = }}\frac{{{\hat{\mathbf{\sigma }}}^{{\mathbf{2}}} }}{{\mathbf{n}}}\left[ {{\mathbf{E}}_{{\mathbf{*}}} \sum\limits_{{{\mathbf{i = 1}}}}^{{{\mathbf{n - m}}}} {{{\varvec{\uprho}}}\left\{ {\frac{{{\mathbf{y}}_{{{\mathbf{i}}\left[ {{\mathbf{ - m}}} \right]}} {\mathbf{ - x}}_{{{\mathbf{\alpha i[ - m]}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{\mathbf{\alpha ,m}}}}^{{\mathbf{*}}} }}{{{\hat{\mathbf{\sigma }}}}}} \right\}} } \right]$$
    (6)

    where \({\hat{\mathbf{\beta }}}_{{{\mathbf{\alpha ,m}}}}^{{\mathbf{*}}}\) is the bootstrap estimate of \({\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}}^{{}}\), \(E_{*}\) denotes expectation with respect to the bootstrap distribution and m is the number of distinct observations in the bootstrap sample used to obtain \(\hat{\beta }_{\alpha ,m}^{*}\) and [-m] means that the m observations are excluded from total observations when calculating \(M_{m,n}^{PE*} (\alpha )\). Here the focus is on the model \(\alpha \in A\) that minimizes \(M_{m,n}^{PE} (\alpha )\), \(M_{m,n}^{PPE} (\alpha )\) or \({\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PE*}}}} {\mathbf{(\alpha )}}\) i.e.

    $${\overline{\mathbf{\alpha }}}_{{{\mathbf{m,n}}}} = \mathop {{\mathbf{arg}}\,{\mathbf{min}}}\limits_{{\alpha \in {\rm A}}} {\mathbf{ M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PE}}}} {\mathbf{(\alpha )}}$$
    (7)
    $${\tilde{\mathbf{\alpha }}}_{{{\mathbf{m,n}}}} {\mathbf{ = }}\mathop {{\mathbf{arg}}\,\,{\mathbf{min}}}\limits_{{\alpha \in {\rm A}}} {\mathbf{ M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PPE}}}} {\mathbf{(\alpha }})$$
    (8)
    $${\hat{\mathbf{\alpha }}}_{{{\mathbf{m,n}}}} = \mathop {\arg \min }\limits_{{\alpha \in {\rm A}}} \, {\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PE*}}}} {\mathbf{(\alpha )}}$$
    (9)

The proposed robust model selection criterion

In this section, we propose a robust model selection procedure based on two components, a robust penalized loss function, and a modified robust expected prediction loss.

We estimate the penalized in-sample term in the criterion function by

$${\mathbf{M}}_{{\mathbf{n}}}^{{\mathbf{P}}} {\mathbf{(\alpha ) = }}\frac{{{\hat{\mathbf{\sigma }}}^{{\mathbf{2}}} }}{{\mathbf{n}}}\left[ {\sum\limits_{{{\mathbf{i = 1}}}}^{{\mathbf{n}}} {{{\varvec{\uprho}}}\left\{ {\frac{{{\mathbf{y}}_{{\mathbf{i}}} {\mathbf{ - x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}} \widehat{{{{\varvec{\upbeta}}}_{{{\varvec{\upalpha}}}} }}}}{{{\hat{\mathbf{\sigma }}}}}} \right\}} {\mathbf{ + \delta (n)p}}_{{{\varvec{\upalpha}}}} } \right]$$
(10)

where \({\mathbf{\delta (n)}}\) denotes a function of sample size n. The two restrictions on function \({\mathbf{\delta (n)}}\) are that \({\mathbf{\delta (n)}} \to \infty\) and \(\frac{{{\mathbf{\delta (n)}}}}{{\mathbf{n}}} \to 0\) as \(n \to \infty\). The two restrictions on δ (n) are imposed to penalize complexity, which expresses a preference for smaller and simpler models. These conditions are satisfied by the choice \({\mathbf{\delta (n)}} = \,\log \,(n)\). We combine (6) and (10) to estimate the robust criterion function by

$${\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PPE*}}}} {\mathbf{(\alpha ) = }}\frac{{{\hat{\mathbf{\sigma }}}^{{\mathbf{2}}} }}{{\mathbf{n}}}\left[ {\sum\limits_{{{\mathbf{i = 1}}}}^{{\mathbf{n}}} {{{\varvec{\uprho}}}\left\{ {\frac{{{\mathbf{y}}_{{\mathbf{i}}} - {\mathbf{x}}_{{{\mathbf{\alpha i}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}} }}{{{\hat{\mathbf{\sigma }}}}}} \right\}} {\mathbf{ + \delta (n)p}}_{{{\varvec{\upalpha}}}} {\mathbf{ + E}}_{{\mathbf{*}}} \sum\limits_{{{\mathbf{i = 1}}}}^{{{\mathbf{n - m}}}} {{{\varvec{\uprho}}}\left\{ {\frac{{{\mathbf{y}}_{{{\mathbf{i}}\left[ {{\mathbf{ - m}}} \right]}} - {\mathbf{x}}_{{{\mathbf{\alpha i[ - m]}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{\mathbf{\alpha ,m}}}}^{{\mathbf{*}}} }}{{{\hat{\mathbf{\sigma }}}}}} \right\}} } \right]$$
(11)

where \({\hat{\mathbf{\beta }}}_{{{\mathbf{\alpha ,m}}}}^{{\mathbf{*}}}\) is the bootstrap estimate of \({\hat{\mathbf{\beta }}}_{{{\varvec{\upalpha}}}}^{{}}\), \(E_{*}\) denotes expectation with respect to the bootstrap distribution and m is the number of distinct observations in the bootstrap sample. An important issue is “how large should the number of bootstrap replications K in our proposed criterion. There is no hard and fast rule for the number of bootstrap replications. However, for estimation of standard error, it is usually in the range of 25–25023. The first term in criterion function (11) measures the relationship between the observed sample data y and X; the second term penalizes complexity (i.e., preference for smaller models), while the ability to predict future observations is measured by the last term. To use (11), we have to specify \(\rho (.)\) and \(\sigma\). The robustness viewpoint is adopted for the purpose of fitting the core of the data and predicting core observations, rather than fitting and predicting the tails having atypical observations. So a bounded \(\rho\) function is selected. Here, trimming is preferred, so that for sufficiently large |x| the \(\rho (x)\) function is constant.

As in11,14,18,19, the simplest \(\rho\) function is given by

$${\mathbf{\rho (x) = min(x}}^{{\mathbf{2}}} {\mathbf{,b}}^{{\mathbf{2}}} {\mathbf{)}}$$
(12)

which is quadratic near the origin and becomes constant when it is away from the origin. As in19, we use b = 2. To measure spread \(\sigma ,\) we use the full model \(\alpha_{{\varvec{f}}}\), because for residuals spread, a large model can produce a valid measure. For simplicity, we measure \(\sigma\) by the median absolute deviation (MAD) from the median multiplied by 1.483 and is given by

$${\hat{\mathbf{\sigma }}} = 1.483\mathop {med}\limits_{1 \le i \le n} \left| {e_{i} - \mathop {med}\limits_{1 \le j \le n} (e_{j} )} \right|$$

where \({\mathbf{ e}}_{{\mathbf{i}}} {\mathbf{ = y}}_{{\mathbf{i}}} - {\mathbf{x}}_{{{{\varvec{\upalpha}}}_{{\mathbf{f}}} {\mathbf{i}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{{\varvec{\upalpha}}}_{{\mathbf{f}}} }} {\mathbf{,}}\;{\mathbf{ e}}_{{\mathbf{j}}} {\mathbf{ = y}}_{{\mathbf{j}}} -_{{{{\varvec{\upalpha}}}_{{\mathbf{f}}} {\mathbf{j}}}}^{{\mathbf{T}}} {\hat{\mathbf{\beta }}}_{{{{\varvec{\upalpha}}}_{{\mathbf{f}}} }}\) and \(\hat{\beta }_{\alpha }^{{}}\) is the estimator for \({{\varvec{\upbeta}}}_{{{\varvec{\upalpha}}}}\).

Among the models being considered, we select a model \(\alpha \in A\) that minimizes \({\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PPE*}}}} {\mathbf{(\alpha )}}\), i.e.

$${\hat{\mathbf{\alpha }}}_{{{\mathbf{m,n}}}}^{{\mathbf{*}}} = \mathop {\arg \min }\limits_{{\alpha \in {\rm A}}} {\mathbf{M}}_{{{\mathbf{m,n}}}}^{{{\mathbf{PPE*}}}} {\mathbf{(\alpha )}}$$
(13)

The optimal m depends on the true model. As in14,19, one should use n/4 ≤  m ≤ n/2 for moderate n (50 ≤  n ≤ 200). If n is small, m is small and the parameter estimators do not converge for some bootstrap samples; but if n is large, m may be smaller than a fourth of n. Choosing the number of strata S at between 3 and 8, depending on the sample size n24.

The penalized loss function in the proposed criterion function, given in (10), is just like a robust version of AIC proposed by25,26. But the main difference is due to the \(\rho\) function and the estimator in our criterion. The penalized in-sample term in (11) is similar to the robust version of3. Furthermore, for \(\rho (x) = (x^{2} ),\) the penalized in-sample term was reduced to3 criterion.

Simulation studies

To assess and compare the finite sample performance of our proposed method with the existent model selection methods, we carried out two simulation studies, that is, one for contamination free dataset in a simulation setting 1 and the other for the contaminated data set in a simulation setting 2.

Simulation setting 1

The finite-sample performance of our proposed criterion is compared with existing model selection procedures via real dataset and simulated data set.

The Gunst and Mason data

To compare the finite sample performance of our proposed method with the existent model selection methods through the real dataset, we use the following regression mode

$${\mathbf{Y}}_{{\mathbf{i}}} {\mathbf{ = \beta }}_{{\mathbf{0}}} {\mathbf{X}}_{{{\mathbf{i0}}}} {\mathbf{ + \beta }}_{{\mathbf{1}}} {\mathbf{X}}_{{{\mathbf{i1}}}} {\mathbf{ + \beta }}_{{\mathbf{2}}} {\mathbf{X}}_{{{\mathbf{i2}}}} {\mathbf{ + \beta }}_{{\mathbf{3}}} {\mathbf{X}}_{{{\mathbf{i3}}}} {\mathbf{ + \beta }}_{{\mathbf{4}}} {\mathbf{X}}_{{{\mathbf{i4}}}} {\mathbf{ + u}}_{{\mathbf{i}}} {\mathbf{,}}\quad {\mathbf{i = 1,}}\,{\mathbf{2,}}\,...{\mathbf{,}}\,{\mathbf{40}}$$

where \(u_{i}\) are iid standard normal errors;\(X_{0}\) is the column of 1’s; and the values of \(X_{1} ,X_{2} ,X_{3}\) and \(X_{4}\) are taken from the solid waste data of27, as in5,12,13,18,19. We compare the estimator \(\hat{\alpha }^{*}_{m,n}\) [expressed in (13)], with \(\overline{\alpha }_{m,n}\) [expressed in (7)], \(\tilde{\alpha }_{m,n}\) [expressed in (8)], \({\hat{\mathbf{\alpha }}}_{{{\mathbf{m,n}}}}\) [expressed in (9)] and robust BIC \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{\alpha }_{n}\) [expressed in (14)],

$${\mathbf{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{\alpha } }}_{{\mathbf{n}}}^{{}} = \mathop {\arg \min }\limits_{{\alpha \in {\rm A}}} \, {\mathbf{M}}_{{\mathbf{n}}}^{{\mathbf{P}}} {\mathbf{(\alpha )}}$$
(14)

In the zero contamination case, the least squares estimator is used to fit the regression models. The penalty term \(\delta (n) = \log (n)\) is used in all simulations. The estimated selection probabilities for \(\hat{\alpha }_{m,n}^{*}\),\(\overline{\alpha }_{m,n}\),\(\tilde{\alpha }_{m,n}\) and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{\alpha }_{n}\) based on the LS estimator and \({\mathbf{\rho (x) = x}}^{{\mathbf{2}}}\) are mentioned in Table 1, whereas the estimated selection probabilities based on the LS estimator and \({\mathbf{\rho (x) = min(x}}^{{\mathbf{2}}} {\mathbf{,b}}^{{\mathbf{2}}} {\mathbf{)}}\) are given in Table 2. The results given in Tables 1 and 2 are based on L = 1000 simulations and K = 100 bootstrap samples for m = 15, 20, 25.

Table 1 Estimated selection probabilities of \(\overline{\alpha }_{m,n}\) , \(\tilde{\alpha }_{m,n}\), \({\hat{\mathbf{\alpha }}}_{{{\mathbf{m,n}}}}\) and \(\hat{\alpha }_{m,n}^{*}\) based on the least squares estimator and \(\rho (x^{2} )\).
Table 2 Estimated selection probabilities of \(\overline{\alpha }_{m,n}\), \(\tilde{\alpha }_{m,n}\), \({\hat{\mathbf{\alpha }}}_{{{\mathbf{m,n}}}}\) and \(\hat{\alpha }_{m,n}^{*}\) based on the least squares estimator and \(\rho (x) = \min (x^{2} ,b^{2} )\).

The simulation resultspresented in Tables 1 and 2 are summarized as follows:

  • The performance of the modified model selection procedure using the least squares estimator is comparable to the existing methods \(\overline{\alpha }_{m,n}\), \(\tilde{\alpha }_{m,n}\), \({\hat{\mathbf{\alpha }}}_{{{\mathbf{m,n}}}}\) and the BIC(\(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{\alpha }_{n}\)).

  • The proposed selection criterion outperforms the existent procedures in both cases, i.e., either using the squared loss function \(\rho (x) = x^{2}\) or the robust loss function \(\rho (x) = \min (x^{2} ,b^{2} )\).

  • For the full model, if bootstrap sample size m increases, the estimated selection probabilities also increase. For example, in the case of m = 15, the correct percent is 93.9%, whereas, for m = 25, the correct percent is 99.1%.

  • Moreover, model selection based on the robust loss function is superior to the squared loss function. For instance, in the case when the optimal model has all the predictors, then the modified model selection procedure \(\hat{\alpha }_{15,40}^{*}\) using the squared loss function selects the optimal model 93.9% of the time, which is less than the 99.2% obtained by using a robust loss function.

  • Furthermore, the modified selection criterion \(\hat{\alpha }_{m,n}^{*}\) is less dependent on bootstrap sample size m as compared to the existent criteria \(\overline{\alpha }_{m,n}\) and \(\tilde{\alpha }_{m,n} .\)

Simulated data and model selection consistency

To show model selection consistency and performance of the proposed criterion on simulated data, the following regression model with p = 5 is considered.

$$y_{i} = x_{i}^{T} \beta + \varepsilon_{i} ,\quad i = 1,2,...,n$$
(15)

where \(\varepsilon_{i}\) is generated from standard normal distribution, the regression variables are generated from \({\rm N}(0,1),\) and added an intercept column of 1’s to produce the design matrix \(X\). To generate the response variable \(y_{i}\), we use Eq. (15).

The true data generating models are:

  • \(\beta_{1} = (1,0,0,1,0)\), i.e. the model had one nonzero variable,

  • \(\beta_{2} = (1,0,0,1,1)\) , i.e. the model had two nonzero variables,

  • \(\beta_{3} = (1,1,0,1,1)\), i.e. the model had three nonzero variables and

The estimated selection probabilities for \(\hat{\alpha }_{m,n}^{*}\), \(\tilde{\alpha }_{m,n}\), \(\hat{\alpha }_{m,n}\) and \(\overline{\alpha }_{m,n}\) are calculated for m = 24 and n = 40,80,120,160, based on L = 1000 simulations with bootstrap replications of K = 50 and are tabulated in Table 3.

Table 3 Selection probabilities of \(\hat{\alpha }_{m,n}^{*}\) , \(\tilde{\alpha }_{m,n}\),\(\hat{\alpha }_{m,n}\) and \(\overline{\alpha }_{m,n}\) based on LS-estimator and \(\rho (x) = \min (x^{2} ,b^{2} )\).

From the simulation results presented in Table 3, we see that our proposed criterion is comparatively consistent procedure for model selection in linear regression problems.

Simulation setting 2

Simulated data from uniform distribution

In this subsection, the finite-sample performance of our modified criterion is compared with existing model selection procedures in the presence of outliers. The sample data is generated from the following model

$$y_{i} = 2 + 2x_{i1} + 0x_{i2} + \in_{i \, } \, i = 1,2,. . . ,64$$

where the design matrix X has columns generated as uniform on [− 1, 1]. The following six different error distributions are considered:

  1. i.

    \(\in_{1}\) is [3/8] outliers (i.e., [5/8]from a standard normal and [3/8] from a normal with \(\mu = 30 - 2 - 2x_{1}\) and \(\sigma = 1\));

  2. ii.

    \(\in_{2}\) is [1/4] outliers (i.e., [3/4] from a standard normal and [1/4] from a normal with \(\mu = 30 - 2 - 2x_{1}\) and \(\sigma = 1\));

  3. iii.

    \(\in_{3}\) is [1/8] outliers (i.e., [7/8] from a standard normal and [1/8] from a normal with \(\mu = 30 - 2 - 2x_{1}\) and \(\sigma = 1\));

  4. iv.

    \(\in_{4}\), the Gaussian distribution with μ = 0 and \(\sigma = 1\);

  5. v.

    \(\in_{5}\), the Cauchy distribution;

  6. vi.

    \(\in_{6}\), the slash distribution(i.e. \(\in_{6} \sim Z/U\) where \(Z \sim N(0,1)\,\,\,\) and \(U \sim U(0,1)\))

In Table 4, the following possible models are considered:

  • Model (1) means, a model with intercept only;

  • Model (1, 2) means a model having intercept and X1;

  • Model (1, 3) means a model having intercept and X2;

  • Model (1, 2, 3) means the full model.

Table 4 Estimated selection probabilities of \(\overline{\alpha }_{24,64}\), \(\hat{\alpha }_{24,64}\), \(\tilde{\alpha }_{24,64}\) and \(\hat{\alpha }_{24,64}^{*}\) based on MM-estimator and LS-estimator.

Following19, the MM-estimator of20 is used to fit the robust regression models. For this purpose, the rlm ( ) function in R is used for estimating the regression parameters. Furthermore, the LS estimates are computed for comparison with MM-estimates. As mentioned by28, when the proportion of extreme observations in some of the bootstrap samples is higher than in the original sample, then the bootstrap distribution may provide a very poor estimator of the distribution of the MM-estimates. To deal with this numerical instability, we use the stratified bootstrap with equal-sized strata. In this approach, bootstrap samples are constructed so that the distribution of the residuals in each bootstrap sample reflects the one observed in the original data set. The selection probabilities based on L = 1000 simulations with bootstrap replications of K = 100 are given in Table 4.

From the simulation results presented in Table 4, it is clear that the modified selection procedure using the robust \(\rho (.)\) function and MM-estimator is robust in the presence of highly contaminated data. For example, the percent correct is 73.8% for un-stratified bootstrap, whereas the percent correct is 99.7% for stratified bootstrap under the contaminated normal situation \(\in_{1}\). For all error distributions, the modified robust criterion outperforms the existence criteria. The simulation studies suggest that when errors are non-normal, then using robust regression is superior to using LS, but in the case of normal errors, robust regression is inferior to LS. Furthermore, in the presence of outliers and heavy-tailed error distributions, the modified robust criterion using MM-estimator outperforms LS-estimator by a large margin. For example, under \(\in_{5}\) error distribution, for MM-estimator the percent correct is 96.9%, whereas, the percent correct is 13.4% for LS-estimator. These results demonstrate that the modified robust procedure has good robustness characteristics with contaminated normal and heavy-tailed distributions, whereas the LS procedure performs very poorly in both cases. This clearly proves the lack of robustness of the LS procedure in the presence of outliers and heavy-tailed distributions. An excellent amount of improvement is obtained in the bootstrap model selection procedure by using the combined criterion as observed in the above simulation study.

Modified solid waste data of Gunst and Mason

To evaluate the performance of our proposed robust model selection method, we modified the Gunst and Mason data by planting 10% and 20% outliers. The response vector is generated as

$${\mathbf{Y}}_{{\mathbf{i}}} {\mathbf{ = \beta }}_{{\mathbf{0}}} {\mathbf{X}}_{{{\mathbf{i0}}}} {\mathbf{ + \beta }}_{{\mathbf{1}}} {\mathbf{X}}_{{{\mathbf{i1}}}} {\mathbf{ + \beta }}_{{\mathbf{2}}} {\mathbf{X}}_{{{\mathbf{i2}}}} {\mathbf{ + \beta }}_{{\mathbf{3}}} {\mathbf{X}}_{{{\mathbf{i3}}}} {\mathbf{ + \beta }}_{{\mathbf{4}}} {\mathbf{X}}_{{{\mathbf{i4}}}} + \varepsilon_{i} ,\quad i = 1,2,...,40$$

where \(X_{0}\) is the column of 1’s; and the values of \(X_{1} ,X_{2} ,X_{3}\) and \(X_{4}\) are taken from the solid waste data of26. To create high-leverage points, we replace the first four to eight observations of each regressor variable value by 20. The true generating model has two non-zero predictors, i.e. \({{\varvec{\upbeta}}}^{{\mathbf{T}}} {\mathbf{ = (2,0,0,4,8)}}\) and we choose the following five different error distributions to represent various deviations from normality:

  1. i.

    \(\in_{1}\) is 10% wild (i.e., 90% from a standard normal and 10% from a normal with, \(\mu = 0\;{\text{ and }}\;\sigma = 0.7\));

  2. ii.

    \(\in_{2}\) is 20% wild (i.e., 80% from a standard normal and 20% from a normal with ,\(\mu = 0 \, \;{\text{and}}\; \, \sigma = 5\)) ;

  3. iii.

    \(\in_{3}\).is t(3) (i.e., t-distribution with 3 degrees of freedom);

  4. iv.

    \(\in_{4}\), is standard normal;

  5. v.

    \(\in_{5}\), is Cauchy distribution with location = 0 and scale = 1.

The selection probabilities of \(\overline{\alpha }_{m,n}\), \(\hat{\alpha }_{m,n}\), \(\tilde{\alpha }_{m,n}\) and \(\hat{\alpha }_{m,n}^{*}\) on the basis of stratified bootstrap with the MM estimator are computed. The selection probabilities based on L = 1000 simulations with bootstrap replications of K = 50 are given in Table 5.

Table 5 Estimated selection probabilities of \(\overline{\alpha }_{m,n}\), \(\tilde{\alpha }_{m,n}\), \({\hat{\mathbf{\alpha }}}_{{{\mathbf{m,n}}}}\) and \(\hat{\alpha }_{m,n}^{*}\) based on MM estimator.

Table 5 demonstrates the simulation results with 10% and 20% of outliers in the covariates and five different error distributions as discussed in the simulation setting. If we look at the results, we see that the performance of our robust procedure is very good for \(\in_{4}\) amongst all error distributions while it does not perform very well for \(\in_{5}\) in the presence of x-outliers. The selection probabilities for error distribution \(\in_{3}\) are similar to that of \(\in_{4}\). Furthermore, the selection probabilities are good for distribution \(\in_{1}\) (10% symmetric wild case) as compared to contamination type \(\in_{2}\) (20% symmetric wild case). Overall, the selection probabilities for each of the criteria decrease when the percentage of both x- and y-outliers goes up. Moreover, selection probabilities in the presence of response outliers and covariate outliers, the performance of our proposed model selection criterion based on MM-estimation is comparable to the existing criteria even when the contamination level changes from i.e., 10% to 20%.

Data example (Stack loss data)

In this section, we analyze the Stack loss data presented by29. This dataset consists of three explanatory variables, and it contains four outliers,namely observations 1, 3, 4, and 21.The response is the Stack loss (y) observed on n = 21 observations. The explanatory variables are theFlow of cooling air (X1), Cooling Water Temperature (X2), and Concentration of acid (X3).We applied our robust method \(\hat{\alpha }_{m,n}^{*}\), the existing methods, and the traditional methods on the data. Table 5 presents a summary of selected best models.

Table 6 shows the classical methodsselect the full model, whereas robust criteria agreed with the importance of the two variables, X1 and X2.The best model according to our criterion includes X1 and X2.

Table 6 Selected best model for the stack loss data using a range of model selection procedures.

Conclusion

In this article, we have presented a novel procedure for robust model selection in linear regression. The criterion is a modification to the bootstrap model selection method based on robust estimator proposed by19. The simulation results reveal that the performance of model selection is improved when using the OOB error in the present studies. Moreover, the undue effect of outliers is controlled by using both a robust MM-estimator and a bounded loss function in the proposed criterion. The proposed model selection criterion can maintain their robust properties in the presence of response outliers and covariate outliers. The proposed criterion is compared with other robust model selection criteria described in previous literature.

We observed that in the presence of outliers and heavy-tailed error distributions, the MM-estimator outperformed the least squares estimator by a large margin. This clearly proved the lack of robustness of the least squares procedure in the presence of outliers and in heavy-tailed distributions. Furthermore, when errors are non-normal, then robust regression is found superior to least squares, but in the case of normal errors, robust regression is found inferior to least squares.

From simulation-based and real-data based results, we conclude that our modified robust model selection procedure is consistent and works well in situations where outliers are present in the data. As observed in our simulation study, an excellent amount of improvement is gained by minimizing the combined criterion, rather than minimizing the penalized loss function or the modified conditional expected prediction loss separately. Furthermore, our robust model selection criterion will perform better when the data generating model is small.