Abstract
This paper considers the problem of estimation of the population total under probability proportional to size (PPS) sampling scheme when complete data is not available due to the presence of missing observations or non-response. The suggested estimators are developed based on available multi-auxiliary information for the response group and non-response group utilizing the calibration approach. The variances of the suggested estimators have been derived up to the first order of approximation. A simulation study done on a real dataset using R software also supports the performance of the suggested estimators. The empirical percentage absolute relative biases (%ARB) and percentage relative root mean squared errors (%RRMSE) are computed for the suggested estimators. The developed estimators are compared with the estimators of the population total due to the design-based Horvitz and Thompson (HT) estimator and calibration HT type estimator obtained on the available complete response units along with the design-based Hansen and Hurwitz (HH) estimator in the presence of non-response.
Similar content being viewed by others
Introduction
In sample survey, the estimation of the unknown population parameter of the study variable such as population total, mean, proportion, etc., for the finite population is the main point of interest. An unbiased estimator for population total under probability proportional to size (PPS) sampling when population size is finite was first developed by Horvitz and Thompson1. Due to the presence of non-response in some parts of a sample, sometime the surveyor is unable to obtain complete information in the sample. The responses of non-responders may be different from the responders which can alter the inferences regarding the population parameter. The issue of non-response is the most frequent challenge faced in sample surveys. Even after several reminders or repetitive efforts, it is difficult to get rid of the issue of non-response and this incomplete data may cause biased estimation. Hansen and Hurwitz2 considered a subsampling approach to deal with the problem of non-response. The methodology involves the (i) selection of a sample from the population, (ii) identification of the nonrespondents in the sample, and (iii) selection of a subsample of nonrespondents.
Following Hansen and Hurwitz2 method of subsampling, Cochran3 discussed the ratio and regression estimators of the population mean of the study variable. In order to address the issue of non-response, several researchers including Rao4,5, Khare and Srivastava6,7, Tripathi and Khare8, Okafor and Lee9, Chhikara et al.10, Singh and Kumar11,12,13,14,15,16, Khare and Sinha17,18 and many more have concentrated on the subsampling technique of non-respondents introduced by Hansen and Hurwitz2.
Deville and Särndal19 suggested a calibration estimation technique in which they used auxiliary information to find new weights corresponding to the design weights in the Horvitz and Thompson estimator. Moreover, the calibration approach is used in a very prominent manner when the information on the auxiliary variable is known to deal with problem of the non-response in the estimation of parameters. Estavao and Samdal20 discussed domain estimation in one-phase sampling, and estimation in two-stage sampling with integrated weighting. Ozgul21 developed a calibration estimator using two auxiliary variables in stratified sampling. Several authors such as Lundström and Särndal22, Kott23, Chang and Kott24,25, Raman et al.26, Audu et al.27, etc. developed various estimators of population parameters using the calibration approach based on the Hansen and Hurwitz2 technique.
In this paper, the calibration estimators of population total have been developed by following the calibration approach illustrated by Deville and Särndal19 and two-step calibration addressed by Singh and Sedory28. Here we have considered two cases when multi-auxiliary information is available on all sampling units in the: (i) response and non-response groups and (ii) response group but available on subsample in the non-response group. Furthermore, the expressions for the variances of the proposed estimators along with the estimators of their variances have also been derived. The performance of the developed estimators is validated by a simulation study in R-Software.
Conceptual advancements
Suppose we have a finite population ‘U’ of size ‘N’ units. Let the primary variable and the auxiliary variable be annotated by ‘Y’ and ‘X’, respectively. A sample ‘s’ of size n is drawn without replacement (WOR) from a population U of size ‘N’, as per the design \(P\left( . \right)\) with the first and second-order inclusion probabilities \(\pi_{k} = \Pr (k \in s)\,and\,\pi_{jk} = \Pr (j,k \in s)\) for including the \(k\)th unit and pair of \(\left( {j,{\kern 1pt} {\kern 1pt} \,k} \right)\)th units(where j and k are not equal, and both belong to population ‘U’), respectively, to the sample. Now, let us define \(\Delta_{jk} = \pi_{jk} - \pi_{j} \pi_{k} ;{\kern 1pt} {\kern 1pt} {\kern 1pt} j \ne k \in U\).
The Horvitz and Thompson1 estimator for the population total \(Y = \sum\limits_{k \in U}^{{}} {y_{k} }\) in case of complete response is given as:
where \(d_{k} = 1/\pi_{k}\) is the kth design weight corresponding to the kth unit.
The variance of the HT estimator \((\hat{Y}_{HT} )\) is given as:
An unbiased estimator for the variance of \(\hat{Y}_{HT}\) is
As discussed in section "Introduction", non-responses may occur even after trying to get responses from all units of the sample ‘s’ of size n, On the other hand, auxiliary information may be available for all elements of a sample.
Let us consider the Hansen and Hurwitz2 procedure of subsampling in which a first phase sample ‘s’ of size n is selected with design P(.). When non-response occurs, the response is assumed to be stochastic in nature. This sample ‘s’ can be dichotomized into two parts response and non-response subsets denoted by \(s_{r}\) and \(s_{nr}\) of sizes \(n_{r}\) and \(n_{nr}\), respectively. Note that the formation of these subsets in sample ‘s’ may vary from one survey to another.
We then select an adequately sufficient subsample \(s^{\prime}_{nr}\) with design P(.|snr) from the non-response subset \(s_{nr}\) and take all required steps to ensure a response from each element of the sub-sample, \(s^{\prime}_{nr} \subseteq s_{nr}\). The first and second-order positive inclusion probabilities are \(\pi_{{k|s_{nr} }}\) and \(\pi_{{jk|s_{nr} }} ;{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} j \ne k \in s_{nr}\) for including the \(k\)th unit and pair of \(\left( {j,{\kern 1pt} {\kern 1pt} k} \right)\) units, respectively, to the sample \(s^{\prime}_{nr}\). Let us define \(\Delta_{{jk|s_{nr} }} = \pi_{{jk|s_{nr} }} - \pi_{{j|s_{nr} }} \pi_{{k|s_{nr} }} ;{\kern 1pt} {\kern 1pt} {\kern 1pt} j \ne k \in s_{nr}\). The population total is estimated by combining the data from the original respondents and the respondents obtained in the non-responding subsample. In this way, the final sample will be \(s^{\prime} = s_{r} \cup s^{\prime}_{nr}\).
Let us assume that the population is divided into two groups response and non-response of sizes Nr and Nnr, respectively, such that \(N_{r} + N_{nr} = \, N\). In practice, sizes Nr and Nnr are usually unknown, and some authors have estimated them with \(N_{r} = \frac{{n_{r} }}{n}N\) and \(N_{nr} = \frac{{n_{nr} }}{n}N\), respectively.
The convention design-based estimator in the presence of non-response for the population total \(Y = \sum\limits_{k \in U} {y_{k} }\) is given as:
where \({\kern 1pt} {\kern 1pt} \pi_{k}{\prime} {\kern 1pt} = \left\{ \begin{gathered} \pi_{k} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \,{\kern 1pt} if{\kern 1pt} {\kern 1pt} \,\,k{\kern 1pt} {\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{r} \hfill \\ \pi_{k} \pi_{{k|s_{nr} }} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \,\,if{\kern 1pt} {\kern 1pt} \,\,k{\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{nr}{\prime} \, \hfill \\ \end{gathered} \right.\)
This can be rewritten as:
The variance of the estimator \(\hat{Y}_{\pi .nr1}\) is given as:
An unbiased estimator of \(V\left( {\hat{Y}_{\pi .nr} } \right)\) is written as:
where \(\pi_{jk}{\prime} = \left\{ {\begin{array}{*{20}c} {\pi_{jk} \pi_{{jk|s_{nr} }} } & {{\kern 1pt} if{\kern 1pt} {\kern 1pt} j,k{\kern 1pt} {\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{nr} } \\ {\pi_{jk} \pi_{{j|s_{nr} }} } & {if{\kern 1pt} {\kern 1pt} j{\kern 1pt} {\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{nr} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} k{\kern 1pt} {\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{r} } \\ {\pi_{jk} \pi_{{k|s_{nr} }} } & {if{\kern 1pt} {\kern 1pt} j{\kern 1pt} {\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{r} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} k{\kern 1pt} {\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{nr} } \\ {\pi_{jk} {\kern 1pt} } & {{\kern 1pt} if{\kern 1pt} {\kern 1pt} j{\kern 1pt} ,k{\kern 1pt} {\kern 1pt} {\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{r} } \\ \end{array} } \right.\)
Gautam et al.29 discussed the scenario when the information on the auxiliary variable was known for only sampled units and suggested an estimator of the population total:
where \(d_{k}\) is the design weight and \(w_{k}\) is the calibrated weight corresponding to the kth unit which is derived by optimizing the chi-square distance function \(\sum\nolimits_{{k \in \,s^{\prime}_{nr} }} {\frac{{\left( {w_{k} - d_{k}{\prime} } \right)^{2} }}{{d_{k}{\prime} q_{k} }}}\) and calibration constraints \(\sum\limits_{{k \in \,s^{\prime}_{nr} }} {w_{k} } = \sum\limits_{{k \in \,s^{\prime}_{nr} }} {d_{k}{\prime} }\) and \(\sum\limits_{{s^{\prime}_{nr} }} {w_{k} x_{k} = \hat{X}_{{s_{nr} }} }\).
where \(d_{k}{\prime} = d_{k} d_{{k\,|\,s_{nr} }}\) and \(\hat{X}_{{s_{nr} }} = \sum\limits_{{k \in U_{nr} }}^{{}} {x_{k} }\).
The final expression of the estimator \(\hat{Y}_{\pi .nr2}\) is given as:
where \(\hat{B} = \frac{{\sum\limits_{{k \in \,\,s^{\prime}_{nr} }} {q_{k} d_{k}{\prime} } \sum\limits_{{k \in \,\,s^{\prime}_{nr} }} {q_{k} d_{k}{\prime} x_{k} y_{k} } - \left( {\sum\limits_{{k \in \,\,s^{\prime}_{nr} }} {q_{k} d_{k}{\prime} x_{k} } } \right)\left( {\sum\limits_{{k \in \,\,s^{\prime}_{nr} }} {q_{k} d_{k}{\prime} y_{k} } } \right)}}{{\sum\nolimits_{{k \in \,\,s^{\prime}_{nr} }} {q_{k} d_{k}{\prime} } \sum\nolimits_{{s^{\prime}_{{k \in \,\,s^{\prime}_{nr} }} }} {q_{k} d_{k}{\prime} x_{k}^{2} } - \left( {\sum\nolimits_{{k \in \,\,s^{\prime}_{nr} }} {q_{k} d_{k}{\prime} x_{k} } } \right)^{2} }}{\kern 1pt} {\kern 1pt}\).
Developed class of estimators
We now propose a class of calibrated estimators for the population total in the presence of non-response under the PPS sampling scheme. Let us now apply the calibration approach to the response group as well as the non-response group which will enhance the efficiency of the estimators. Here we consider the cases in the following subsections:
When auxiliary information is known for both response and non-response groups
Let us presume that we have information accessible on g auxiliary variables denoted by \(X_{1} ,\,X_{2} ,\,...,\,X_{g}\) for the entire population. The population totals, say Ti.r and Ti.nr of the ith auxiliary variable are also assumed to be known for both response (\(s_{r}^{{}}\)) and non-response (\(s_{nr}^{{}}\)) groups, respectively. We now define a linear combination of population totals of the auxiliary variables corresponding to response and non-response groups as:
where \(T_{i.r} = \sum\limits_{{k \in U_{r} }}^{{}} {X_{ik} }\),\(T_{i.nr} = \sum\limits_{{k \in U_{nr} }}^{{}} {X_{ik} }\), 0 ≤ αi, βi ≤ 1 for i = 1, . . . , g and \(\sum\limits_{i = 1}^{g} {\alpha_{i} } = \sum\limits_{i = 1}^{g} {\beta_{i} } = 1\).
The sample estimate of Tx.r and Tx.nr can be defined as:
where \(\hat{T}_{i.r} = \sum\limits_{{k \in s_{r} }} {d_{k} x_{ik} }\)
where \(\hat{T}_{i.nr} = \sum\limits_{{k \in s^{\prime}_{nr} }} {d_{k}{\prime} x_{ik} }\) and \(d_{k}{\prime} = d_{k} d_{{k|\,s_{nr} }}\).
Following Deville and Sarndal19, we propose the calibrated estimator of population total corresponding to the Hansen and Hurwitz estimator for population total mentioned in (4) as:
where \(\psi_{k}\) and \(\psi_{k}{\prime}\) are the calibration weights corresponding to response and non-response groups, respectively.
For applying the calibration approach, we modify the given design weights \(d_{k}\) and \(d_{k}{\prime}\) utilizing the available information on g auxiliary variables. Let us state the following calibration constraints for the response and non-response groups:
and
and
where \(\delta_{k} \,and\,\gamma_{k}\) are suitably chosen constants, \(\hat{T}_{ci.r} = \sum\limits_{{k \in s_{r} }} {\psi_{k} x_{ik} }\) and \(\hat{T}_{ci.nr} = \sum\limits_{{k \in s_{nr} }} {\psi_{k}{\prime} x_{ik} }\).
With the help of calibration procedure, we adjust the design weights \(d_{k}\) and \(d_{k} d_{{k\,|\,s_{nr} }}\) by optimizing the Chi-square distance functions \(\sum\limits_{{k \in s_{r} }}^{{}} {\frac{{\left( {\psi_{k} - d_{k} } \right)^{2} }}{{d_{k} q_{k} }}}\) and \(\sum\limits_{{k \in s^{\prime}_{nr} }}^{{}} {\frac{{\left( {\psi_{k}{\prime} - d_{k} d_{{k\,|\,s_{nr} }} } \right)^{2} }}{{d_{k} d_{{k\,|\,s_{nr} }} q_{k} }}}\) with respect to the calibration constraints defined in (15), (16), (17) and (18), respectively, for the response and non-response groups.
To obtain the proposed calibrated weights \(\psi_{k} ;\,k \in s_{r}\) and \(\psi_{k}{\prime} ;\,k \in s^{\prime}_{nr}\) for the response and non-response groups, respectively, the Lagrange function is expressed as:
After minimization of the Lagrange function with respect to \(\psi_{k}\) and \(\psi_{k}{\prime}\), the calibrated weights for the response and non-response groups are determined as follows:
Now, substituting the value of the calibrated weights \(\psi_{k}\) in the calibration constraints of the response group given in (15), we obtain:
From (16), we obtain:
Similarly, we obtain the following equations after substituting the value of the calibrated weights \(\psi_{k}{\prime}\) for the non-response group in the calibration constraints given in (17) and (18):
After solving the (22) and (23), the values of \(\lambda_{11}\) and \(\lambda_{12}\) are obtained as:
where
The final calibrated weights \(\psi_{k}\) for response group can be obtained by substituting the values of \(\lambda_{11}\) and \(\lambda_{12}\) into (20) as:
After solving (24) and (25), the values of \(\lambda_{21}\) and \(\lambda_{22}\) obtained as:
where
The final expression of calibrated weights \(\psi_{k}{\prime}\) for non-response group after substituting the values of \(\lambda_{21}\) and \(\lambda_{22}\) into (21) is determined as:
The proposed calibration estimator of the population total using updated set of calibrated weights is obtained as:
Finally, the proposed calibration estimator using multiple auxiliary variables in the presence of non-response can also be expressed as:
where,
Special cases:
(i) For one auxiliary variable, i.e., g = 1, and \(\delta_{k} = \gamma_{k} = 1\), the proposed estimator given in (30) will reduce to the estimator given as follows:
where,
(ii) In case of two auxiliary variables, i.e., g = 2 and \(\delta_{k} = \gamma_{k} = 1\), the suggested estimator mentioned in (30) will take the following form:
where,
Remarks
(i) We can choose the different values of two-step parameters \(\delta_{k}\) and \(\gamma_{k}\). The values of \(\delta_{k}\) and \(\gamma_{k}\) can be determined by minimizing the mean squared errors of the suggested estimators. As described in Singh and Sedory28, the optimal values of the two-step parameters \(\delta_{k}\) and \(\gamma_{k}\) will lie in the interval (0.98, 0.996) which minimizes the variance of the estimator. Hence, we will consider \(\delta_{k} = \gamma_{k} = 1\), approximately in the simulation study carried out in section "Simulation study".
(ii) Subsequently, we can also determine the optimum values of \(\alpha_{i}\) and \(\beta_{i}\) for i = 1, 2, …, g by minimizing the variance. However, it will become complicated as we increase the value of g > 2. For the sake of simplicity, we may either choose different weights corresponding to each auxiliary variable which satisfies \(\sum\limits_{i = 1}^{g} {\alpha_{i} } = 1\) and \(\sum\limits_{i = 1}^{g} {\beta_{i} } = 1\) or may assign equal weights to each auxiliary variable, i.e. \(\alpha_{i} = \beta_{i} = \frac{1}{g}\) for i = 1, 2, …, g.
The proposed calibration estimator given in (30) can be reduced for \(\delta_{k} = \gamma_{k} = 1\) as:
The variance of the proposed estimator as per the Sarndal et al.30 to the first order of approximation is derived as:
where \({\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \varepsilon_{1k} = \left( {y_{k} - \xi_{2.r} t_{kx.r} } \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \varepsilon_{2k} = \left( {y_{k} - \xi_{2.nr} t_{kx.nr} } \right)\) and \(\varepsilon_{3k} = \left( {y_{j} - \hat{\xi }^{\prime\prime\prime}_{2.nr} t_{kx.nr} } \right)\).
\(t_{kx.r} = \left( {\alpha_{1} x_{1k} + \alpha_{2} x_{2k} + ... + \alpha_{g} x_{gk} } \right)\) and \(t_{kx.nr} = \left( {\beta_{1} x_{1k} + \beta_{2} x_{2k} + ... + \beta_{g} x_{gk} } \right)\)
An estimator for variance of the proposed estimator is given as:
where \({\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} e_{1k} = \left( {y_{k} - \hat{\xi }_{2.r} t_{kx.r} } \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} e_{2k} = \left( {y_{k} - \hat{\xi }^{\prime\prime\prime}_{2.nr} t_{kx.nr} } \right)\), \(e_{3k} = \left( {y_{j} - \hat{\xi }_{2.nr} t_{kx.nr} } \right)\).
and
When auxiliary information is not known for both groups
In Section "When auxiliary information is known for both response and non-response groups", we assumed that population totals of multi-auxiliary variables (X1, X2, …, Xg) are known for both groups, i.e., response and non-response groups. Here we consider the case when population totals of multi-auxiliary variables (X1, X2, …, Xg) are not known for both response and non-response groups. In this case, we go for two-phase sampling scheme and draw a sufficiently large sample \(s^{*}\) of size \(n^{*}\) from a population of size N with sampling design P(.) in the first phase. Let us again define the first and second order inclusion probabilities \(\pi_{{\text{k}}}^{*} {\text{ = P( k}} \in s^{*} {)}\) and \(\pi_{{{\text{jk}}}}^{*} {\text{ = P( j,}}\,{\text{k}} \in s^{*} {)}\) in the first phase. Let \(s_{r}^{*}\) be the sample subset of size \(n_{r}^{*}\) and \(s_{nr}^{*}\) be the sample subset of size \(n_{nr}^{*}\) for the response and non-response groups, respectively. In the first-phase sample, the information on the multi-auxiliary variables is available for response and non-response groups. We now obtain the estimated population total of the ith auxiliary variable from the first-phase sample for both groups as:
where \(d_{{\text{k}}}^{*} = \frac{1}{{\pi_{{\text{k}}}^{*} }}\) and i = 1, 2, …, g.
Then we select a second-phase sample s of size n from the first-phase sample \(s^{*}\) with sampling design \(P\left( {.|s^{*} } \right)\). The first and second-order inclusion probabilities for the second-phase sample are \(\pi_{{k\,|\,s^{*} }} {\text{ = P( k}} \in s\,|\,s^{*} {)}\) and \(\pi_{{{\text{jk}}\,|\,s^{*} }} {\text{ = P( j,}}\,{\text{k}} \in s\,\,|\,s^{*} {)}\), respectively. Here \(s_{r}\) be the responding subset and \(s_{nr}\) be the non-responding subset of sizes \(n_{r}\) and \(n_{nr}\), respectively. Due to item non-response for the study variable, we further select a subset \(s^{\prime}_{nr}\) of size \(n^{\prime}_{nr}\) from non-responding subset \(s_{nr}\). In this way, the final sample will be \(s^{\prime\prime} = s_{r} \cup s^{\prime}_{nr}\).
Since the population totals Ti.r and Ti.nr of the ith auxiliary variable are unknown and estimated from the first-phase sample, we define linear combinations of population totals of the auxiliary variables for response and non-response groups on the basis of first-phase sample as:
where 0 ≤ αi, βi ≤ 1 for i = 1, 2,. . . , g and \(\sum\limits_{i = 1}^{g} {\alpha_{i} } = \sum\limits_{i = 1}^{g} {\beta_{i} } = 1\).
The HH type estimator in case of unknown population totals of the auxiliary variables is defined as:
When population totals of auxiliary variables are not known for the entire population, the suggested calibration estimator of population total can be defined as:
where \(\psi_{k}^{*}\) and \(\psi_{k}^{**}\) are the calibration weights.
The calibration constraints for response and non-response groups are stated as:
Response group:
and
Non-response group:
and
where \(\delta_{k}^{*}\) and \(\gamma_{k}^{*}\) are suitably chosen constants whereas \(\hat{T}_{ci.r}^{*} = \sum\limits_{{k \in s_{r} }} {\psi_{k}^{*} x_{ik} }\) and \(\hat{T}_{ci.nr}^{*} = \sum\limits_{{k \in s^{\prime}_{nr} }} {\psi_{k}^{**} x_{ik} }\).
For minimizing the chi-square distance functions with respect to the calibration constraints defined in (40, 41) and (42, 43), we formulate the Lagrange’s function as:
After differentiating the Lagrange’s function with respect to \(\psi_{k}^{*}\) and \(\psi_{k}^{**}\) and equating to zero, we substitute the value of the calibrated weights in the calibration constraints. The values of Lagrange’s multipliers are obtained as:
where
The final expression of calibrated weights \(\psi_{k}^{*}\) and \(\psi_{k}^{**}\) for response group and non-response group are determined as:
The proposed calibrated estimator of the population total is expressed as:
where,
The variance of \(\hat{Y}_{c\pi .dg}\), up to the first order approximation is given by:
where, \({\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \varepsilon_{1k} = \left( {y_{k} - \xi_{2.r} t_{kx.r} } \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \varepsilon_{2k} = \left( {y_{k} - \xi_{2.nr} t_{kx.nr} } \right)\), \({\kern 1pt} \varepsilon_{1j}^{*} = \left( {y_{k} - \hat{\xi }_{2.r}^{^{\prime}*} t_{kx.r} } \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \varepsilon_{2j}^{*} = \left( {y_{k} - \hat{\xi }_{2.nr}^{^{\prime}*} t_{kx.nr} } \right)\) and \(\varepsilon_{3k} = \left( {y_{j} - \hat{\xi }_{2.r}^{^{\prime\prime}*} t_{kx.nr} } \right)\), \(\Delta_{jk}^{*} = \pi_{jk}^{*} - \pi_{j}^{*} \pi_{k}^{*}\) and \(\Delta_{{jk|s^{*} }}^{*} = \Delta_{{jk|s^{*} }}^{*} - \Delta_{{j|s^{*} }}^{*} \Delta_{{k|s^{*} }}^{*}\).
The estimator of the variance \(V\left( {\hat{Y}_{c\pi .dg} } \right)\) up to the first-order approximation is:
where, \(e_{1k} = \left( {y_{k} - \hat{\xi }_{2.r}^{^{\prime}*} t_{kx.r} } \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} e_{2k} = \left( {y_{k} - \hat{\xi }_{2.nr}^{^{\prime}*} t_{kx.nr} } \right)\), \({\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} e_{1k}^{*} = \left( {y_{k} - \hat{\xi }_{2.r}^{*} t_{kx.r} } \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} e_{2k}^{*} = \left( {y_{k} - \hat{\xi }_{2.nr}^{^{\prime\prime}*} t_{kx.nr} } \right)\), \(e_{2k}^{^{\prime}*} = \left( {y_{k} - \hat{\xi }_{2.nr}^{*} t_{kx.nr} } \right)\) and \(\pi_{jk}^{*} = \left\{ \begin{gathered} \pi_{{jk|s^{*} }} \pi_{{jk|s_{nr} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} }} if{\kern 1pt} {\kern 1pt} j,k \in {\kern 1pt} {\kern 1pt} s_{nr} \hfill \\ \pi_{{jk|s^{*} }} \pi_{{j|s_{nr} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} }} {\kern 1pt} if{\kern 1pt} {\kern 1pt} j{\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{nr} ,k \in {\kern 1pt} {\kern 1pt} s_{r} \hfill \\ \pi_{{jk|s^{*} }} \pi_{{k|s_{nr} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} }} \,if{\kern 1pt} {\kern 1pt} j \in {\kern 1pt} {\kern 1pt} s_{r} ,k \in {\kern 1pt} {\kern 1pt} s_{nr} \hfill \\ \pi_{{jk|s^{*} }} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} {\kern 1pt} j,k{\kern 1pt} \in {\kern 1pt} {\kern 1pt} s_{r} \hfill \\ \end{gathered} \right.\).
Simulation study
In this section, we evaluate the functioning of the developed estimators with the help of a simulation study carried out in R software. The simulation study is conducted a real-population MU284, given in Sarndal et al.30 given as:
In the study we considered RMT85 as study variable (Y) and P75 and ME84 as the auxiliary variables X1 and X2, respectively. Variable SS82 is taken as Z variable to compute the inclusion probabilities to draw the required samples.
We have considered 10, 20, and 30 percent of non-response (%NR) occurred in the data, as well as different sizes of responses (4, 6, 8); i.e. we have generated R = 20,000 samples, obtained from the non-response group after a follow-up.
The empirical percentage absolute relative bias (%ARB) and percentage relative root mean squared error (%RRMSE) are computed for the estimators of population total using the following formulae:
The percent relative efficiency of the estimator \(\hat{Y}_{i\alpha }\) with respect to the estimators \(\hat{Y}_{\pi .nr1}\) can be obtained as:
We now compare the suggested calibrated estimators \(\hat{Y}_{c\pi .1}\) and \(\hat{Y}_{c\pi .2}\), HH estimator \(\hat{Y}_{\pi .nr1}\), Gautam et al.29 estimator \(\hat{Y}_{\pi .nr2}\), by computing %ARB, %RRMSE and %RE for all selected samples for different values the % non-response with \(\delta_{{\text{k}}} { = }\gamma_{{\text{k}}} { = 1}\) as well as \(\alpha_{{\text{i}}} { = }\beta_{{\text{i}}} { = }\frac{{1}}{g}\).
The %ARB, %RRMSE and %RE are computed for all scenarios discussed above. The values of %ARB are given in Table 1, %RRMSE in Table 2 whereas %RE in Table 3 for known population totals of the auxiliary variables. The values of %ARB are given in Table 4, %RRMSE in Table 5 whereas %RE in Table 6 for unknown population totals of the auxiliary variables.
Results and discussion
When population totals of the auxiliary variables are known and accessible for both response and non-response groups, Tables 1, 2 and 3 illustrate the followings:
-
The percentage absolute relative biases (%ARB) shown in Table 1 varies from 0.2091% to 0.3810% for HH estimator (\(\hat{Y}_{\pi .nr1}\)), 0.1843% to 0.1938% for Gautam et al.29 estimator (\(\hat{Y}_{\pi .nr2}\)), 0.0738% to 0.0813% and 0.0120% to 0.0247% for the proposed estimators, \(\hat{Y}_{c\pi .1}\) and \(\hat{Y}_{c\pi .2}\), respectively.
-
The percentage relative root mean squared errors (%RRMSE) shown in Table 2 varies from 24.4250% to 40.06712% for HH estimator (\(\hat{Y}_{\pi .nr1}\)), 21.3285% to 22.2373% for Gautam et al.29 estimator (\(\hat{Y}_{\pi .nr2}\)), 8.9803% to 10.1187% and 1.5767% to 3.7224% for the proposed estimators, \(\hat{Y}_{c\pi .1}\) and \(\hat{Y}_{c\pi .2}\), respectively.
-
Table 3 depicts the percent relative efficiency (%RE). The values of %RE vary from 271.98% to 395.96% and 1029.10% to 1805.81% for the suggested estimators \(\hat{Y}_{c\pi .1}\) and \(\hat{Y}_{c\pi .2}\), respectively, for one and two auxiliary variables.
When population totals of the auxiliary variables are not known and estimated from the first phase samples for both response and non-response groups, Tables 4, 5 and 6 describe the followings:
-
The percentage absolute relative biases (%ARB) shown in Table 4 varies from 0.2106 to 0.2785% for HH estimator (\(\hat{Y}_{\pi .nr1}\)), 0.1964% to 0.2% for Gautam et al.29 estimator (\(\hat{Y}_{\pi .nr2}\)), 0.1425% to 0.1425% and 0.1278% to 0.1290% for the proposed estimators, \(\hat{Y}_{c\pi .1}\) and \(\hat{Y}_{c\pi .2}\), respectively.
-
The percentage relative root mean squared errors (%RRMSE) shown in Table 2 varies from 25.5774% to 32.7749% for HH estimator (\(\hat{Y}_{\pi .nr1}\)), 23.6388% to 24.0998% for Gautam et al.29 estimator (\(\hat{Y}_{\pi .nr2}\)), 17.5062% to 17.8050% and 15.5658% to 15.7624% for the proposed estimators, \(\hat{Y}_{c\pi .1}\) and \(\hat{Y}_{c\pi .2}\), respectively.
-
Table 6 illustrates the percent relative efficiency (%RE). The values of %RE vary from 145.52% to 184.14% and 164.31% to 207.92% for the suggested estimators \(\hat{Y}_{c\pi .d1}\) and \(\hat{Y}_{c\pi .d2}\), respectively, for one and two auxiliary variables.
The findings of Table 1 demonstrate that the suggested calibrated estimators designed for the scenario in which auxiliary data is accessible for both response and non-response groups as well as not available for both groups have lower %ARB in comparison to all other estimators for all three distinct sets of sample sizes. Additionally, the % RRMSE determined in Tables 2 and 5 for the developed proposed estimators are the lowest amidst all other considered estimators. It can be clearly seen that the proposed estimators using multi-auxiliary information outperform the existing estimators in terms of reducing % absolute relative biases and % relative root mean squared errors in both cases in the presence of non-response. The same patterns are also evident from Figs. 1 and 2. For more surety and clarity, the percentage relative efficiency can also be observed in Tables 3 and 6 for known and unknown cases, respectively, in the presence of non-response.
Conclusion
We developed generalized calibration estimators for finite population totals in the presence of non-response employing available information on multi-auxiliary variables. Specifically, we focused on two scenarios in this paper. In the first scenario, the auxiliary information is assumed to be known for all population units, whereas in the second scenario, it is unknown for all population units. The expressions for the population total and its variance are derived for these two scenarios. The simulation study has also been carried out to validate the theoretical conclusions.
The proposed estimators are compared with the existing estimators using real data in the simulation study to evaluate the performance of the proposed estimators. The values of Percentage Absolute Relative Bias (%ARB), Percentage Relative Root Mean Squared Error (%RRMSE) and % Relative Efficiency (%RE) are documented in Tables 1, 2, 3, 4, 5 and 6. The findings of the simulation study show that the suggested calibrated estimators provide better results when it comes to absolute relative biases and relative root mean squared errors. It means the proposed calibration estimators outperform the estimators by Hansen and Hurwitz2, Gautam et al.29 in both scenarios. Moreover, the significant improvement (in terms of reduced %ARB and %RRMSE) in the suggested estimators is apparent by its application to the actual data.
Data availability
All data generated or analysed during this study are included in this published article.
References
Horvitz, D. G. & Thompson, D. J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952).
Hansen, M. H. & Hurwitz, W. N. The problem of the non-response in sample surveys. J. Am. Stat. Assoc. 41, 517–529 (1946).
Cochran, W. G. Sampling Techniques 3rd edn. (John Wiley & Sons, 1977).
Rao, P. S. R. S. Ratio and regression estimates with sub sampling the non respondents. In Paper presented at a special contributed session of the International Statistical Association Meeting, Tokyo, Japan, Sept. pp. 2–16 (1987).
Rao, P. S. R. S. Ratio estimation with sub sampling the non-respondents. Surv. Methodol. 12, 217–230 (1986).
Khare, B. B. & Srivastava, S. Estimation of population mean using auxiliary character in presence of non-response. Natl. Acad. Sci. Lett. India 16, 111–114 (1993).
Khare, B. B. & Srivastava, S. Study of conventional and alternative two phase sampling ratio, product and regression estimators in presence of non-response. Proc. Indian Natl. Sci. Acad. 65, 195–203 (1995).
Tripathi, T. P. & Khare, B. B. Estimation of mean vector in presence of non-response. Commun. Stat. Theory Methods 26(9), 2255–2269 (1997).
Okafor, F. C. & Lee, H. Double sampling for ratio and regression estimation with sub-sampling the non-respondents. J. Survey Stat. Methodol. 26(2), 183–188 (2000).
Chhikara, R. S. & Sud, U. C. Estimation of population and domain totals under two-phase sampling in the presence of non-response. J. Indian Soc. Agric. Stat. 63(3), 297–304 (2009).
Singh, H. P. & Kumar, S. A general class of estimators of the population mean in survey sampling using auxiliary information with sub sampling the non-respondents. Kor. J. Appl. Statist. 22, 387–402 (2009).
Singh, H. P. & Kumar, S. A general family of estimators of the finite population ratio, product and mean using two phase sampling scheme in the presence of non-response. J. Stat. Theory Pract. 2(4), 677–692 (2008).
Singh, H. P. & Kumar, S. A general procedure of estimating the population mean in the presence of non-response under double sampling using auxiliary information. SORT 33, 71–84 (2009).
Singh, H. P. & Kumar, S. A regression approach to the estimation of the finite population mean in the presence of non-response. Aust. N. Z. J. Stat. 50(4), 395–402 (2008).
Singh, H. P. & Kumar, S. Estimation of mean in the presence of non-response using two phase sampling scheme. Stat. Pap. 51, 559–402 (2010).
Singh, H. P. & Kumar, S. Improved estimation of population mean under double sampling with sub-sampling the non-respondents. J. Stat. Plan. Infer. 140(9), 2536–2550 (2010).
Khare, B. B. & Sinha, R. R. Estimation of population mean using multiauxiliary characters with sampling the non-respondents. Stat. Transit. New Ser. 12(1), 45–56 (2011).
Khare, B. B. & Sinha, R. R. Estimation of the ratio of the two population means using multiauxiliary characters in the presence of non-response. In Statistical Techniques in Life Testing (ed. Pandey, B. N.) 163–171 (Sampling Theory and Quality Control, Narosa Publishing House, New Delhi, India, 2007).
Deville, J. C. & Särndal, C. E. Calibration estimators in survey sampling. J. Am. Stat. Assoc. 87(418), 376–382 (1992).
Estavao, V. M. & Samdal, C. E. Survey estimates by calibration on complete auxiliary information. Int. Stat. Rev. 74, 127–147 (2016).
Ozgul, N. New calibration estimator based on two auxiliary variables in stratified sampling. Commun. Stat. Theory Methods 48(6), 1481–1492 (2018).
Lundström, S. & Särndal, C. E. Calibration is a standard method for the treatment of nonresponse. J. Off. Stat. 15(2), 305–327 (1999).
Kott, P. S. Using calibration weighting to adjust for nonresponse and coverage errors. J. Survey Stat. Methodol. 32(2), 133–142 (2006).
Chang, T. & Kott, P. S. Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika 95(3), 555–571 (2008).
Kott, P. S. & Chang, T. Using calibration weighting to adjust for nonignorable unit nonresponse. J. Am. Stat. Assoc. 105(491), 1265–1275 (2010).
Raman, R. K., Sud, U. C. & Chandra, H. Calibration approach for estimating population total with subsampling of non-respondents under single-and two-phase sampling. Commun. Stat. Theory Methods 45(10), 2842–2856 (2016).
Audu, A. et al. On the estimation of finite population variance for a mail survey design in the presence of non-response using new conventional and calibrated estimators. Commun. Stat. Theory Methods 53(3), 848–864 (2024).
Singh, S. & Sedory, S. A. Two-step calibration of design weights in survey sampling. Commun. Stat. Theory Methods 45(12), 3510–3523 (2016).
Gautam, A. K., Sharma, M. K. & Sisodia, B. V. S. Development of calibration estimator of population mean under non-response. Int. J. Commun. Soc. 8(5), 1360–1371 (2020).
Sarndal, C. E., Swensson, B. & Wretman J. In Model Assisted Survey Sampling (Springer Verlag, New York, USA, 2003).
Acknowledgements
The authors are thankful to the Chief Editor: Dr. Rafal Marszalek, and learned referees for their valuable
suggestions regarding improvement of the article.
Author information
Authors and Affiliations
Contributions
The idea of the estimator generation is of N.G.; N.G. and A.P. developed the theoretical framework of the article. A.P and M.P carried out simulation studies and drafted the article. All the authors read and approved the final article.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Garg, N., Patel, A. & Pachori, M. Calibration estimation of population total using multi-auxiliary information in the presence of non-response. Sci Rep 14, 17247 (2024). https://doi.org/10.1038/s41598-024-68203-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-68203-2
Keywords
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.