Design based synthetic imputation methods for domain mean

Bhushan, Shashi; Kumar, Anoop; Pokhrel, Rohini; Bakr, M. E.; Mekiso, Getachew Tekle

doi:10.1038/s41598-024-53909-0

Download PDF

Article
Open access
Published: 21 February 2024

Design based synthetic imputation methods for domain mean

Shashi Bhushan¹,
Anoop Kumar²,
Rohini Pokhrel³,
M. E. Bakr⁴ &
…
Getachew Tekle Mekiso⁵

Scientific Reports volume 14, Article number: 4250 (2024) Cite this article

276 Accesses
Metrics details

Subjects

Abstract

In real life, situations may arise when the available data are insufficient to provide accurate estimates for the domain, the small area estimation (SAE) technique has been used to get accurate estimates for the variable under study. The problem of missing data is a serious problem that has an impact on sample surveys, but small area estimates are especially prone to it. This paper is a basic effort that suggests design based synthetic imputation methods for the domain mean estimation using simple random sampling in order to address the issue of missing data under SAE. The expression of the mean square error for the proposed imputation methods are obtained up to first order approximation. The efficiency conditions are determined and a thorough simulation study is carried out using artificially generated data sets. An application is included with real data that further supports this study.

Entropy, irreversibility and inference at the foundations of statistical physics

Article 01 May 2024

Genome-wide association studies

Article 26 August 2021

Two common and distinct forms of variation in human functional brain networks

Article 30 April 2024

Introduction

The majority of surveys are only intended to offer estimates at the national and/or state/territory geographic levels that are statistically valid and design-based. Implementing and carrying out sample surveys that would produce accurate estimates at levels smaller than state/territory would be extremely difficult and expensive, both in terms of the larger sample sizes needed and the increased burden on survey respondents. Small area estimates are produced using small area estimation (SAE) techniques to get beyond the issue of small sample numbers and outperform the accuracy of direct survey estimates derived from the sample in each small region. Direct, synthetic, and other indirect estimations are some of the techniques used for SAE. The direct estimators solely employ information from the specified region under study. Mostly, they are unbiased, but very unstable having large variation. Indirect and composite estimators are more accurate because they additionally include information from related variables or nearby areas.

The direct estimators have been shown to produce unacceptable large standard errors as a result of asymmetric small samples from the relevant small area. In reality, there may be circumstances when no sample units can be selected from a portion of small domains. Finding indirect (synthetic) estimators, that dramatically increase sample size and subsequently reduce the standard error of the estimator is therefore necessary to achieve appropriate statistical accuracy. According to Gonzalez¹ “an estimator is called a synthetic estimator if a reliable direct estimator for a large area, covering several small domains, is used to derive an indirect estimate for a small domain, under the assumption that the small areas have the same characteristics as the large area”. Developing indirect estimators for small areas is necessary since there is a lack of sufficient sample data in small geographic areas. Numerous researchers, particularly in the fields of health, agriculture, and poverty, have developed synthetic estimators. According to recent research by Tikkiwal and Ghiya², Pandey and Tikkiwal³, Tikkiwal et al.⁴, Ashutosh et al.^5,6, Bhushan et al.⁷, small area estimators based on auxiliary information outperform those that exclude it.

The issue of missing data is persistent in sample surveys and necessitates quick action to prevent the validity of any conclusions drawn from such data. The properties such as unbiasedness and efficiency of the estimators might both be compromised by the missing data. Imputation of missing data is the preferred and most often used method for dealing with missing data. Rubin⁸ proposed three fundamental conceptions in his landmark work: missing at random (MAR), observed at random (OAR), and parameter distribution (PD). A discrimination between missing at random (MAR) and missing completely at random (MCAR) was provided by Heitjan and Basu⁹. Many renowned writers have addressed the issue of missing data, and different imputation approaches have been used to fill in the gaps. The accessibility of adequate supplementary information is critical for the creation of effective imputations schemes. Numerous prominent researchers, including Rueda et al.¹⁰, Toutenburg and Srivastava¹¹, Toutenburg et al.¹², Singh and Horn¹³, Prasad¹⁴, Singh and Deo¹⁵, Singh¹⁶, Ahmed et al.¹⁷, Bhushan and Pandey^18,19, Bhushan et al.^20,21, Prasad²², Prasad and Yadav²³, Bhushan and Kumar²⁴ have studied in this field and developed imputations and the corresponding estimators for missing data utilizing auxiliary information. In this study, we use the MCAR approach to impute missing data altogether.

Further, in literature, no imputation method is available to solve the issue of missing data under SAE. Therefore, the objectives of this article are:

(i)
to propose some fundamental imputations, namely, mean, ratio, logarithmic type for estimating the domain mean;
(ii)
to propose Searls type logarithmic imputation methods estimating the domain mean;
(iii)
to compare the fundamental imputations with our Searls type logarithmic imputation methods.

Note that while imputing the missing observations, we do not modify the original responses. The methodology and notations used in this study are discussed below.

Methodology and notations

Consider a specified population $\Phi =\{1, 2,\ldots , N\}$ of the size N from which a simple random sample s of the size n is drawn without replacement. In order to estimate the mean of domain d, we use the information collected in the sample. Further, let $r_d$ and r be the amount of units responding from chosen $n_d$ and n units and let $R_d$ and R be the set of units responding in the domain d and total population, respectively. Also, ${\bar{R}}_d$ and ${\bar{R}}$ symbolize the set of units non-responding in the domain d and total population, respectively. For all units, $i\in R$, the quantity $y_i$ is obtained, but for the units $i \in {\bar{R}}$, the quantities are missing and imputed data must be obtained to finalize the formation of sample data set. Suppose, the imputation is accomplished comprising the additional auxiliary information, X, so $X_i$, the value of X for unit i, is available and positive for all $i \in s$ such that the data $\mathbf {X_s}=\left\{ X_i;~i \in s\right\}$ are available.

To derive the mean square error (MSE) of the consequent synthetic estimators of the proposed synthetic imputation methods, we take the following notations: ${\bar{y}}_{r}={\bar{Y}}(1+\varepsilon _0)$, ${\bar{x}}_{r}={\bar{X}}(1+\varepsilon _1)$, and ${\bar{x}}_{n}={\bar{X}}(1+\varepsilon _2)$, the $\varepsilon 's$ are error terms such that $E(\varepsilon _k)=0,~k=0,1,2$ and $E(\varepsilon _0^2)=f_{r}C_{y}^2$, $E(\varepsilon _1^2)=f_{r}C_{x}^2$, $E(\varepsilon _2^2)=f_{n}C_{x}^2$, $E(\varepsilon _0\varepsilon _1)=f_{r}\rho _{yx}C_{y}C_{x}$, $E(\varepsilon _0\varepsilon _2)=f_{n}\rho _{yx}C_{y}C_{x}$, $E(\varepsilon _1\varepsilon _2)=f_{n}C_{x}^2$, where, $f_{r}=\left( \frac{1}{r}-\frac{1}{N}\right)$ and $f_{n}=\left( \frac{1}{n}-\frac{1}{N}\right)$, $C_{y}$ and $C_{x}$ are the coefficient of variation of study and auxiliary variables, respectively, $\rho _{yx}$ is the correlation coefficient between study and auxiliary variables.

The content that follows is broken up into a few sections. In “Adapted imputation methods” and “Proposed synthetic Searls type logarithmic imputation methods”, respectively, the adapted and proposed imputation methods are presented together with formulae for the mean square error (MSE). In “Efficiency conditions”, a comparison of the various imputation strategies is given. In “Simulation study”, a comprehensive simulation analysis using a few artificial populations is provided, and the main simulation results are explored. In “Real data application”, an actual data application is also provided. In “Conclusions”, this article is concluded with some concluding remarks.

Adapted imputation methods

Since literature contains no imputation methods to deal with the problem of estimation of mean of domain d in the presence of missing data. Therefore, we adapt some conventional imputation methods for the estimation of domain mean.

Conventional mean imputation method

When information on the auxiliary variables is not available, then the conventional mean imputation method is the obvious choice. When the ith sample unit in domain d is missing and requires imputation, we suggest the mean imputation of domain mean by amplifying the notations of Lee et al.²⁵ for unit value imputation. The synthetic mean imputation technique for domain mean is given by

$$\begin{aligned} y_{{.i_m}}= {\left\{ \begin{array}{ll} y_i&{}\text {if }i \in R\\ {\bar{y}}_{r} &{}\text {if }i \in {\bar{R}} \end{array}\right. } \end{aligned}$$

The consequent synthetic estimator is

$$\begin{aligned} t_{m}&={\bar{y}}_{r} \end{aligned}$$

The MSE of the consequent synthetic mean estimator is

$$\begin{aligned} MSE(t_m)&=({\bar{Y}}-{\bar{Y}}_d)^2+{\bar{Y}}^2f_{r}C_{y}^2 \end{aligned}$$

(1)

The imputation approaches are distinguished into two schemes when additional auxiliary information is taken into account.

Scheme I: When ${\bar{X}}_d$ is known and ${\bar{x}}_{n,d}$ is used.

Scheme II: When ${\bar{X}}_d$ is known and ${\bar{x}}_{r,d}$ is used.

Synthetic ratio imputation methods

The ratio imputation method provides efficient results when the study and auxiliary variables are positively correlated. The classical synthetic ratio imputation methods under schemes I and II are defined as

Scheme I

$$\begin{aligned} y_{{.i}_{r_1}}&= {\left\{ \begin{array}{ll} y_i&{}\text {if }i \in R\\ \frac{1}{n-r}\left[ n{\bar{y}}_{r}\left( \frac{{\bar{X}}_d}{{\bar{x}}_{n}}\right) -r{\bar{y}}_{r}\right] &{}\text {if }i \in {\bar{R}} \end{array}\right. } \end{aligned}$$

Scheme II

$$\begin{aligned} y_{{.i}_{r_2}}&= {\left\{ \begin{array}{ll} y_i&{}\text {if }i \in R\\ \frac{1}{n-r}\left[ n{\bar{y}}_{r}\left( \frac{{\bar{X}}_d}{{\bar{x}}_{r}}\right) -r{\bar{y}}_{r}\right] &{}\text {if }i \in {\bar{R}} \end{array}\right. } \end{aligned}$$

The consequent synthetic ratio estimators under above schemes are

$$\begin{aligned} t_{r_1}&={\bar{y}}_{r}\left( \frac{{\bar{X}}_d}{{\bar{x}}_{n}}\right) \\ t_{r_2}&={\bar{y}}_{r}\left( \frac{{\bar{X}}_d}{{\bar{x}}_{r}}\right) \end{aligned}$$

Theorem 2.1

The MSE of the consequent synthetic ratio estimators $t_{r_j},~j=1,2$ of the synthetic ratio imputation methods $y_{{.i}_{r_j}}$ under schemes I and II is given by

$$\begin{aligned} MSE(t_{r_1})&={\bar{Y}}_d^2\left( f_{r}C_{y}^2+f_{n}C_{x}^2-2f_{n}\rho _{yx}C_{y}C_{x}\right) \end{aligned}$$

(2)

$$\begin{aligned} MSE(t_{r_2})&={\bar{Y}}_d^2f_{r}\left( C_{y}^2+C_{x}^2-2\rho _{yx}C_{y}C_{x}\right) \end{aligned}$$

(3)

Synthetic logarithmic imputation methods

The proposed synthetic logarithmic imputation methods under schemes I and II are given below.

Scheme I

$$\begin{aligned} y_{{.i}_{l_1}}&= {\left\{ \begin{array}{ll} y_i&{}\text {if }i \in R\\ \frac{1}{n-r}\left[ n{\bar{y}}_{r}\left\{ 1+\theta _{1}\log \left( \frac{{\bar{x}}_{n}}{{\bar{X}}_d}\right) \right\} -r{\bar{y}}_{r}\right] &{}\text {if }i \in {\bar{R}} \end{array}\right. } \end{aligned}$$

Scheme II

$$\begin{aligned} y_{{.i}_{l_2}}&= {\left\{ \begin{array}{ll} y_i&{}\text {if }i \in R\\ \frac{1}{n-r}\left[ n{\bar{y}}_{r}\left\{ 1+\theta _{2}\log \left( \frac{{\bar{x}}_{r}}{{\bar{X}}_d}\right) \right\} -r{\bar{y}}_{r}\right] &{}\text {if }i \in {\bar{R}} \end{array}\right. } \end{aligned}$$

The resulting estimators are calculated under the schemes described above as

$$\begin{aligned} t_{l_1}&={\bar{y}}_{r}\left\{ 1+\theta _1\log \bigg (\frac{{\bar{x}}_{n}}{{\bar{X}}_d}\bigg )\right\} \\ t_{l_2}&={\bar{y}}_{r}\left\{ 1+\theta _2\log \bigg (\frac{{\bar{x}}_{r}}{{\bar{X}}_d}\bigg )\right\} \end{aligned}$$

where $\theta _j$; $j=1,2$ are the suitably chosen scalars.

Theorem 2.2

The MSE and minimum MSE of the consequent synthetic estimators $t_{l_j},~j=1,2$ of the proposed synthetic imputation methods $y_{{.i}_{l_j}}$ under schemes I and II are given by

$$\begin{aligned} MSE(t_{l_1})&={\bar{Y}}_d^2(f_{r}C_{y}^2+\theta _{1}^2f_{n}C_{x}^2-2\theta _1f_{n}\rho _{yx}C_{y}C_{x})\\ MSE(t_{l_2})&={\bar{Y}}_d^2f_{r}(C_{y}^2+\theta _{2}^2C_{x}^2-2\theta _2\rho _{yx}C_{y}C_{x})\\ minMSE(t_{l_1})&={\bar{Y}}^2_dC_{y}^2\left( f_{r}-f_{n}\rho _{yx}^2\right) \\ minMSE(t_{l_2})&={\bar{Y}}^2_dC_{y}^2f_{r}\left( 1-\rho _{yx}^2\right) \end{aligned}$$

Proposed synthetic Searls type logarithmic imputation methods

In order to increase the effectiveness of the estimators, Searls²⁶ developed a transformation that required multiplying a tuning parameter in the estimators. Therefore, in order to improve the above works, we used a tuning parameter $\delta _j,~j=1,2$ in the synthetic logarithmic imputation methods $y_{{.i}_{l_j}}$ and propose synthetic Searls type logarithmic imputation methods for the mean of domain d utilizing auxiliary information in SRS.

The proposed synthetic Searls type logarithmic imputation methods under schemes I and II are given below.

Scheme I

$$\begin{aligned} y_{{.i}_{s_1}}&= {\left\{ \begin{array}{ll} y_i&{}\text {if }i \in R\\ \frac{1}{n-r}\left[ n\delta _1{\bar{y}}_{r}\left\{ 1+\theta _{1}\log \left( \frac{{\bar{x}}_{n}}{{\bar{X}}_d}\right) \right\} -r{\bar{y}}_{r}\right] &{}\text {if }i \in {\bar{R}} \end{array}\right. } \end{aligned}$$

Scheme II

$$\begin{aligned} y_{{.i}_{s_2}}&= {\left\{ \begin{array}{ll} y_i&{}\text {if }i \in R\\ \frac{1}{n-r}\left[ n\delta _2{\bar{y}}_{r}\left\{ 1+\theta _{2}\log \left( \frac{{\bar{x}}_{r}}{{\bar{X}}_d}\right) \right\} -r{\bar{y}}_{r}\right] &{}\text {if }i \in {\bar{R}} \end{array}\right. } \end{aligned}$$

where $\delta _j$, $j=1,2$ are the suitably chosen scalars. The resulting synthetic estimators are calculated under the schemes described above as

$$\begin{aligned} t_{s_1}&=\delta _1{\bar{y}}_{r}\left\{ 1+\theta _1\log \bigg (\frac{{\bar{x}}_{n}}{{\bar{X}}_d}\bigg )\right\} \\ t_{s_2}&=\delta _2{\bar{y}}_{r}\left\{ 1+\theta _2\log \bigg (\frac{{\bar{x}}_{r}}{{\bar{X}}_d}\bigg )\right\} \end{aligned}$$

Special case

When $\delta _j=1,~j=1,2$, then under schemes I and II, the proposed synthetic Searls type logarithmic imputation methods $y_{{.i}_{s_j}}$ and the corresponding resultant synthetic Searls type logarithmic estimators $t_{s_j}$ deform into the synthetic logarithmic imputation methods $y_{{.i}_{l_j}}$ and the corresponding resultant synthetic logarithmic estimators $t_{l_j}$, respectively.

Theorem 3.1

The MSE and minimum MSE of the consequent synthetic estimators $t_{s_j},~j=1,2$ of the proposed synthetic imputation methods $y_{{.i}_{s_j}}$ under schemes I and II are given by

$$\begin{aligned} MSE(t_{s_1})&=\left[ \begin{array}{l}{\bar{Y}}_{d}^2+\delta _1^2\left\{ {\bar{Y}}^2_d+f_r{\bar{Y}} _d^2C_y^2+f_n\theta _1^2{\bar{Y}}^2C_x^2+4\theta _1{\bar{Y}}{\bar{Y}}_df_n \rho _{xy}C_xC_y-\theta _1{\bar{Y}}{\bar{Y}}_df_nC_x^2\right\} \\ -2\delta _1\left\{ {\bar{Y}}^2 +\theta _1{\bar{Y}}{\bar{Y}}_df_n\left( \rho _{xy}C_xC_y-\frac{C_x^2}{2}\right) \right\} \end{array}\right] \\ MSE(t_{s_2})&=\left[ \begin{array}{l}{\bar{Y}}_{d}^2+\delta _2^2\left\{ {\bar{Y}}^2_d +f_r{\bar{Y}}_d^2C_y^2+f_r\theta _2^2{\bar{Y}}^2C_x^2+4\theta _2{\bar{Y}}{\bar{Y}}_df_ r\rho _{xy}C_xC_y-\theta _2{\bar{Y}}{\bar{Y}}_df_rC_x^2\right\} \\ -2\delta _2\left\{ {\bar{Y}}^2 +\theta _2{\bar{Y}}{\bar{Y}}_df_r\left( \rho _{xy}C_xC_y- \frac{C_x^2}{2}\right) \right\} \end{array}\right] \\ minMSE(t_{s_1})&={\bar{Y}}^2_d- \frac{Q_1^2}{P_1}\\ minMSE(t_{s_2})&={\bar{Y}}^2_d- \frac{Q_2^2}{P_2} \end{aligned}$$

where

$$\begin{aligned} P_1&={\bar{Y}}^2_d+f_r{\bar{Y}}_d^2C_y^2+f_n\theta _1^2{\bar{Y}}^2C_x^2+4\theta _1{\bar{Y}}{\bar{Y}}_df_n \rho _{xy}C_xC_y-\theta _1{\bar{Y}}{\bar{Y}}_df_nC_x^2,\\ Q_1&={\bar{Y}}^2+\theta _1{\bar{Y}}{\bar{Y}}_df_n\left( \rho _{xy}C_xC_y-\frac{C_x^2}{2}\right) ,\\ P_2&={\bar{Y}}^2_d+f_r{\bar{Y}}_d^2C_y^2+f_r\theta _2^2{\bar{Y}}^2C_x^2+4\theta _2{\bar{Y}}{\bar{Y}}_df_r \rho _{xy}C_xC_y-\theta _2{\bar{Y}}{\bar{Y}}_df_rC_x^2,\\ ~~\text {and}~~ Q_2&={\bar{Y}}^2+\theta _2{\bar{Y}}{\bar{Y}}_df_r\left( \rho _{xy}C_xC_y-\frac{C_x^2}{2}\right) . \end{aligned}$$

Proof

Consider the proposed consequent synthetic estimator $t_{s_1}$ as

$$\begin{aligned} t_{s_1}&=\delta _1{\bar{y}}_{r}\left\{ 1+\theta _1\log \bigg (\frac{{\bar{x}}_{n}}{{\bar{X}}_d}\bigg )\right\} \end{aligned}$$

We can express the above estimator using the notations established in the previous section as

$$\begin{aligned} t_{s_1}&=\delta _1{\bar{Y}}(1+\varepsilon _0)\left[ 1+{\theta _1}\log \bigg \{\frac{{\bar{X}} (1+\varepsilon _2)}{{\bar{X}}_d}\bigg \}\right] \\&=\delta _1{\bar{Y}}(1+\varepsilon _0)\left[ 1+{\theta _1}\bigg \{\log \left( \frac{{\bar{X}}}{{\bar{X}}_d} \right) +\log (1+\varepsilon _2)\bigg \}\right] \\&=\delta _1{\bar{Y}}(1+\varepsilon _0)\left[ 1+{\theta _1}\bigg \{A+ \left( \varepsilon _2-\frac{\varepsilon _2^2}{2}+\cdots \right) \bigg \}\right] \end{aligned}$$

Simplifying the above expression and neglecting the higher order error terms, we get

$$\begin{aligned} t_{s_1}&=\delta _1{\bar{Y}}\left\{ 1+\varepsilon _0+\theta _1A+\theta _1\left( \varepsilon _2-\frac{\varepsilon _2^2}{2} \right) +\theta _1(A\varepsilon _0+\varepsilon _0\varepsilon _2)\right\} \end{aligned}$$

Subtracting ${\bar{Y}}_d$ on both sides to the above expression, we get

$$\begin{aligned} t_{s_3}-{\bar{Y}}_{d}&=\delta _1{\bar{Y}}(1+\theta _1A)-{\bar{Y}}_{d}+\delta _1{\bar{Y}} \left\{ \varepsilon _0+\theta _1\left( \varepsilon _2-\frac{\varepsilon _2^2}{2}\right) +\theta _1(A\varepsilon _0+\varepsilon _0\varepsilon _2)\right\} \end{aligned}$$

(4)

Squaring and taking expectation both sides to (4), we get MSE of the estimator $t_{s_1}$ to the first order approximation as

$$\begin{aligned} MSE(t_{s_1})&=\left[ \begin{array}{l}\{\delta _1{\bar{Y}}(1+\theta _1A)-{\bar{Y}}_{d}\}^2 +2\delta _1\theta _1{\bar{Y}}\{\delta _1{\bar{Y}}(1+\theta _1A)-{\bar{Y}}_{d}\}f_n \left( \rho _{xy}C_xC_y-\frac{C_x^2}{2}\right) \\ +\alpha ^2{\bar{Y}}^2\left\{ (1+\theta _1A)^ 2f_rC_y^2+\theta _1^2f_nC_x^2+2\theta _1(1+\theta _1A)f_n\rho _{xy}C_xC_y\right\} \end{array}\right] \end{aligned}$$

(5)

Under the assumption of Searls logarithmic synthetic estimation ${\bar{Y}}(1+\theta _1A)={\bar{Y}}_d$, the $MSE(t_{s_1})$ can be expressed as

$$\begin{aligned} MSE(t_{s_1})&=\left[ \begin{array}{l}{\bar{Y}}_{d}^2+\delta _1^2\left\{ {\bar{Y}}^2_d+f_r{\bar{Y}}_d^2C_y^2 +f_n\theta _1^2{\bar{Y}}^2C_x^2+4\theta _1{\bar{Y}}{\bar{Y}}_df_n\rho _{xy}C_xC_y -\theta _1{\bar{Y}}{\bar{Y}}_df_nC_x^2\right\} \\ -2\delta _1\left\{ {\bar{Y}}^2 +\theta _1{\bar{Y}}{\bar{Y}}_df_n\left( \rho _{xy}C_xC_y-\frac{C_x^2}{2}\right) \right\} \end{array}\right] \nonumber \\&={\bar{Y}}_d^2+\delta _1^2P_1-2\delta _1Q_1 \end{aligned}$$

(6)

where

$$\begin{aligned} P_1&={\bar{Y}}^2_d+f_r{\bar{Y}}_d^2C_y^2+f_n\theta _1^2{\bar{Y}}^2C_x^2 +4\theta _1{\bar{Y}}{\bar{Y}}_df_n\rho _{xy}C_xC_y-\theta _1{\bar{Y}}{\bar{Y}}_df_nC_x^2\\ \text {and}~~ Q_1&={\bar{Y}}^2+\theta _1{\bar{Y}}{\bar{Y}}_df_n\left( \rho _{xy}C_xC_y-\frac{C_x^2}{2}\right) . \end{aligned}$$

Partially differentiating (6) regarding $\delta _1$ and equating to zero, we get the optimum value of $\delta _1$ as

$$\begin{aligned} \delta _{1(opt)}=\frac{Q_1}{P_1} \end{aligned}$$

Putting the optimum value of $\delta _1$ from the above expression to (6), we get minimum MSE of the estimator $t_{s_1}$ as

$$\begin{aligned} min.MSE(t_{s_1})&={\bar{Y}}_{d}^2-\frac{Q_1^2}{P_1} \end{aligned}$$

(7)

Similarly, the first order approximated expressions of MSE and minimum MSE of the proposed synthetic estimator $t_{s_2}$ can be obtained. $\square$

Efficiency conditions

In the present section, we compare the minimum MSE of the proposed synthetic imputation methods with the corresponding minimum MSE of the existing synthetic imputation methods under schemes I and II.

Lemma 4.1

The proposed synthetic Searls type logarithmic imputation methods $y_{.i_{s_j}},~j=1,2$ dominate the synthetic mean imputation method $y_{.i_{m}}$, if

$$\begin{aligned} MSE(t_{s_j})&<MSE(t_m)\implies \frac{Q_j^2}{P_j}>1-\frac{({\bar{Y}}-{\bar{Y}}_d)^2}{{\bar{Y}}_d^2}-\frac{{\bar{Y}}^2}{{\bar{Y}}_d^2}f_{r}C_{y}^2 \end{aligned}$$

Lemma 4.2

The proposed synthetic Searls type logarithmic imputation methods $y_{.i_{s_j}},~j=1,2$ dominate the synthetic ratio imputation methods $y_{.i_{r_j}}$ under schemes I and II, if

$$\begin{aligned} MSE(t_{s_j})&<MSE(t_{r_j})\implies \frac{Q_j^2}{P_j}>1-f_{r}C_{y}^2-f_{n}C_{x}^2+2f_{n}\rho _{yx}C_{y}C_{x} \end{aligned}$$

Lemma 4.3

The proposed synthetic Searls type logarithmic imputation methods $y_{.i_{s_j}},~j=1,2$ dominate the synthetic logarithmic imputation methods $y_{.i_{l_j}}$ under schemes I and II, if

$$\begin{aligned} MSE(t_{s_j})&<MSE(t_{l_j})\implies \frac{Q_j^2}{P_j}>1-C_{y}^2(f_{r}-f_{n}\rho _{yx}^2) \end{aligned}$$

The proposed synthetic Searls type logarithmic imputation methods repress the synthetic mean per unit imputation method, synthetic ratio imputation methods and synthetic logarithmic imputation methods, if the aforementioned lemmas are satisfied. The next section verifies the above lemmas utilizing a comprehensive simulation study.

Simulation study

A simulation study is executed to assess the effectiveness of the suggested synthetic imputation methods in comparison to the adapted synthetic imputation methods. In the simulation procedure, certain symmetrical and asymmetrical populations are produced in accordance with the models employed by Singh and Horn²⁷. The model used are as follows:

$$\begin{aligned} y&=5.5+\sqrt{(1-\rho _{xy}^2)}~y^*+\rho _{xy}\left( \frac{S_y}{S_x}\right) x^*\\ x&=5.3+x^* \end{aligned}$$

where $x^*$ and $y^*$ are independent variables for the corresponding distributions. Considering the above models, we have generated the below mentioned populations:

1.
A Normal population of size N=6000 using $x^*\sim N(12,35)$ and $y^*\sim N(13,45)$ with varying correlation coefficients $\rho _{xy}$=0.1, 0.5, 0.9.
2.
A Gamma population of size N=6000 using $x^*\sim G(0.02,0.006)$ and $y^*\sim G(0.2,0.011)$ with varying correlation coefficients $\rho _{xy}$=0.1, 0.5, 0.9.

The above populations are divided into 6 equal domains of size 1000. We have drawn a random sample of sizes $(n_1,~n_2,~n_3,~n_4,~n_5,~n_6)=(200,~250,~300,~350,~100,~150)$ from the respective domains and chosen the varying response rates $r_1=(170,~180)$, $r_2=(230,~240)$, $r_3=(270,~280)$, $r_4=(330,~340)$, $r_5=(80,~90)$, and $r_6=(130,~140)$ from the respective samples. The imputation strategy is taken and the MSE of the consequent estimators is computed by utilizing 15,000 iterations. The simulation procedure is explained in the undermentioned steps.

(i)
Select a sample s of size n randomly from the population of size N.
(ii)
Bring out randomly ($n_d$-$r_d$) sample units through sample s every time.
(iii)
Impute selected units by considering the proposed imputation methods studied for quantified samples.
(iv)
Compute the needed statistics.
(v)
Iterated the prior steps 15,000 times.

The empirical (simulated) mean square error (EMSE) and the theoretical mean square error (TMSE). The TMSE is calculated using the MSE expressions of the respective estimators obtained in “Adapted imputation methods” and “Proposed synthetic Searls type logarithmic imputation methods”, while the EMSE is calculated utilizing the following formula:

$$\begin{aligned} EMSE(t_{*})&=\frac{1}{15,000}\sum _{i=1}^{15,000}(t_{*}-{\bar{Y}}_d)^2 \end{aligned}$$

(8)

where $t_{*}$=$t_{m}$, $t_{r_j},~j=1,2$, $t_{l_j}$, $t_{s_j}$.

The results of the consequent synthetic estimators for normal and gamma populations are reported in Tables 1 and 2, respectively.

Key results of simulation study

We interpret the key results of simulation study summarized from Tables 1 to 2 in the following points.

1.
The outcomes drawn from normal population for the consequent synthetic estimators are reported in Table 1. These outcomes show that:
1. (a)
  the EMSE and TMSE of the consequent synthetic ratio estimator $t_{r_1}$ under scheme I decreases with the successive increase in the correlation coefficient $\rho _{xy}$ from 0.1 to 0.9. This tendency in the EMSE and TMSE values of $t_{r_1}$ can be also observed from scheme II for the estimator $t_{r_2}$.
2. (b)
  the EMSE and TMSE of the consequent synthetic logarithmic estimator $t_{l_1}$ under scheme I decreases with the successive increase in the values of correlation coefficient $\rho _{xy}$ from 0.1 to 0.9. This tendency in the EMSE and TMSE values of $t_{l_1}$ can be also observed from scheme II for the consequent synthetic logarithmic estimator $t_{l_2}$.
3. (c)
  the EMSE and TMSE of the consequent synthetic Searls type logarithmic estimator $t_{s_1}$ under scheme I decreases with the successive increase in the correlation coefficient $\rho _{xy}$ from 0.1 to 0.9. This tendency in the EMSE and TMSE values of $t_{s_1}$ can be also observed from scheme II for the consequent synthetic Searls type logarithmic estimator $t_{s_2}$.
4. (d)
  the EMSE and TMSE of the consequent synthetic ratio estimators, synthetic logarithmic estimators, and synthetic Searls type logarithmic estimators decreases with the increase in the responding units $r_d$ under schemes I and II in each domain.
5. (e)
  the EMSE and TMSE of the consequent synthetic ratio estimators, synthetic logarithmic estimators, and synthetic Searls type logarithmic estimators under both schemes in each domain are observed to be very close to each other.
6. (f)
  the consequent synthetic Searls type logarithmic estimators $t_{s_j},~j=1,2$ perform better than the adapted synthetic mean estimator $t_{m}$, synthetic ratio estimators $t_{r_j}$, and synthetic logarithmic estimators $t_{l_j}$ under schemes I and II.
2.
The similar tendency as observed from the results of Table 1 obtained from normal population for synthetic estimators can also be observed from the results of Table 2 obtained from gamma population for synthetic estimators.
3.
Finally, from the results of Tables 1 and 2, the performance of the synthetic ratio estimators, synthetic logarithmic estimators, and synthetic Searls type logarithmic estimators is better under scheme II compared to scheme I.

Table 1 EMSE and TMSE of synthetic estimators under normal population.

Full size table

Table 2 EMSE and TMSE of synthetic estimators under gamma population.

Full size table

Real data application

Like most other Indian states, Uttar Pradesh is separated into a several districts for the purpose of taking taxes and conducting other administrative and agricultural works. Each district is further separated into a number of tehsils, and each tehsil is further separated into several blocks. Blocks are referred to as small domains in this study.

Since the area used for cultivation determines the yield of every crop. Therefore, for applications using real data, we take into account the problem of estimating agricultural output for various blocks in the Agra district of Uttar Pradesh. Six blocks in the Agra district are referred as small domains. The amount of Bajra crop produced (in tonnes) for the agricultural season 2021–2022 is regarded as the study variable y, whilst the area of Bajra crop produced (in hectares) for the agricultural season 2021–2022 is regarded as the auxiliary variable x. Various information regarding the blocks of Agra district are reported in Table 3, whereas for easy reference, the parameters for each domain are shown in Table 4.

Table 3 Total production and area under Bajra crop in Blocks of Agra district for agricultural season 2021–2022.

Full size table

Table 4 Population parameters for different domains.

Full size table

From the domain sizes $(N_1,~N_2,~N_3,~N_4,~N_5,~N_6)=(38,~53,~66,~45,~44,~53)$ mentioned in Table 4, we have selected samples $(n_1,~n_2,~n_3,~n_4,~n_5,~n_6)=(8,~11,~13,~9,~9,~11)$, respectively. Out of these selected samples, the responding units are taken as $r_1=(5,~7)$, $r_2=(7,~9)$, $r_3=(9,~11)$, $r_4=(5,~7)$, $r_5=(5,~7)$, and $r_6=(7,~9)$, respectively. Taking the parameters of domain given in Tables 3 and 4, we have computed the MSE of the proposed synthetic estimators.

The results based on the real data for synthetic estimators are reported in Table 5, respectively, which show the dominance of the proposed synthetic Searls type logarithmic imputation methods over the corresponding synthetic mean, ratio, and logarithmic type imputation methods. Under both schemes, the proposed synthetic imputation methods outperform the corresponding synthetic mean, ratio, and logarithmic type imputation methods. The MSE of the adapted and proposed synthetic estimators decreases as the responding units increase under both schemes in each domain. Moreover, the adapted synthetic ratio imputations, synthetic logarithmic imputations and the proposed synthetic Searls type logarithmic imputations perform better in scheme II compared to scheme I.

Table 5 MSE of synthetic estimators for real population.

Full size table

Conclusions

In the current article, we have adapted synthetic mean, ratio, and logarithmic imputation methods, while proposing synthetic Searls type logarithmic imputation methods for the estimation of domain mean in the case of missing data under simple random sampling. The algebraic expressions of MSE for the proposed imputation methods are derived to first order approximation. The algebraic conditions are obtained by comparing the MSE expressions of the proposed and adapted imputations. Furthermore, a comprehensive simulation is executed using a deliberately drawn normal (symmetric) and gamma (asymmetric) population in order to assess the performance of the suggested imputation approaches. The EMSE and TMSE obtained in simulation study show that for varying amounts of correlation coefficient as well as responding units in each domain, the suggested synthetic Searls type logarithmic imputation techniques excel compared to the adapted synthetic mean, ratio, and logarithmic imputation methods. Further, from the results of Tables 1 and 2, the EMSE and TMSE of the adapted and suggested estimators are observed to be very close to each other under both the schemes in each domain. In addition, an actual data set based on the production of Bajra crops in the Agra district of Uttar Pradesh, India, is also used to demonstrate the applicability of the suggested imputation approaches. The results of the real data also favour the suggested imputations compared to the adapted imputations. Therefore, under SAE, if missing data is identified, survey practitioners may be advised to employ the suggested imputation procedures.

Data availibility

All data generated or analysed during this study are included in this published article.

References

Gonzalez, M. E. Use and evaluation of synthetic estimates. In Proceedings of the Social Statistics Section American Statistical Association, 33–36 (1973).
Tikkiwal, G. C. & Ghiya, A. A generalized class of synthetic estimators with application to crop acreage estimation for small domains. Biom. J. 42(7), 865–876 (2000).
Article MathSciNet Google Scholar
Pandey, K. K. & Tikkiwal, G. C. Generalized class of synthetic estimators for small area under systematic sampling design. Stat. Trans. New Ser. Pol. 11(1), 75–89 (2010).
Google Scholar
Tikkiwal, G. C., Rai, P. K. & Ghiya, A. On the performance of generalized regression estimator for small domains. Commun. Stat. Simul. Comput. 42(4), 891–909 (2013).
Article MathSciNet Google Scholar
Ashutosh, A., Shahzad, U., Al-Noor, N. H. & Rai, P. K. Simulation study of small domain with calibration approach. Concurr. Comput. 34(27), e7323. https://doi.org/10.1002/cpe.7323 (2022).
Article Google Scholar
Ashutosh, A., Shahzad, U. & Al-Noor, N. H. Calibration estimation of subpopulation total for direct and indirect situations. Commun. Stat. Theory Methodshttps://doi.org/10.1080/03610926.2023.2256437 (2023).
Article Google Scholar
Bhushan, S., Kumar, A. & Pokhrel, R. Logarithmic type direct and synthetic estimators using bivariate auxiliary information with an application to real data. J. Ind. Soc. Ag. Stat. 77(1), 133–148 (2023).
Google Scholar
Rubin, R. B. Inference and missing data. Biometrika 63(3), 581–592 (1976).
Article MathSciNet Google Scholar
Heitjan, D. F. & Basu, S. Distinguishing ‘Missing at Random’ and ‘Missing Completely at Random’. Am. Stat. 50(3), 207–213 (1996).
MathSciNet Google Scholar
Rueda, M., Gonzalez, S. & Arcos, A. Indirect methods of imputation of missing data based on available units. Appl. Math. Comput. 164(1), 249–261 (2005).
MathSciNet Google Scholar
Toutenburg, H. & Srivastava, V. K. Estimation of ratio of population means in survey sampling when some observations are missing. Metrika 48, 177–187 (1998).
Article MathSciNet Google Scholar
Toutenburg, H., Srivastava, V. K. & Shalabh, X. Amputation versus imputation of missing values through ratio method in sample surveys. Stat. Pap. 49, 237–247 (2008).
Article MathSciNet Google Scholar
Singh, S. & Horn, S. Compromised imputation in survey sampling. Metrika 51, 267–276 (2000).
Article MathSciNet Google Scholar
Prasad, S. A study on new methods of ratio exponential type imputation in sample surveys. Hacettepe J. Math. Stat. 47(5), 1281–1301 (2018).
ADS MathSciNet Google Scholar
Singh, S. & Deo, B. Imputation by power transformation. Stat. Pap. 44, 555–579 (2003).
Article MathSciNet Google Scholar
Singh, S. A new method of imputation in survey sampling. Stat. J. Theoret. Appl. Stat. 43(5), 499–511 (2009).
ADS MathSciNet Google Scholar
Ahmed, M. S., Al-Titi, O., Al-Rawi, Z. & Abu-Dayyeh, W. Estimation of a population mean using different imputation methods. Stat. Transit. 7(6), 1247–1264 (2006).
Google Scholar
Bhushan, S. & Pandey, A. P. Optimal imputation of the missing data using multi auxiliary information. Comput. Stat. 36(1), 449–477 (2020).
Article MathSciNet Google Scholar
Bhushan, S. & Pandey, A. P. Optimality of ratio-type imputation methods for estimation of population mean using higher order moment of an auxiliary variable. J. Stat. Theory Pract. 15, 1–35 (2021).
Article MathSciNet Google Scholar
Bhushan, S., Kumar, A., Pandey, A. P. & Singh, S. Estimation of population mean in presence of missing data under simple random sampling. Commun. Stat. Simul. Comput. 52(12), 6048–6069 (2023).
Article MathSciNet Google Scholar
Bhushan, S., Kumar, A., Zaman, T. & Al Mutairi, A. Efficient difference and ratio-type imputation methods under ranked set sampling. Axioms 12(6), 558 (2023).
Article Google Scholar
Prasad, S. Some compromised exponential ratio type imputation methods in simple random sampling. Proc. Natl. Acad. Sci. India Sect. A Phys. Sci. 91, 337–349 (2021).
Article MathSciNet Google Scholar
Prasad, S. & Yadav, V. K. Imputation of missing data through product type exponential methods in sampling theory. Rev. Colomb. Estad. 46(1), 111–127 (2023).
Article MathSciNet Google Scholar
Bhushan, S. & Kumar, A. Imputation of missing data using multi auxiliary information under ranked set sampling. Commun. Stat. Simul. Comput.https://doi.org/10.1080/03610918.2023.2288796 (2023).
Article Google Scholar
Lee, H., Rancourt, E. & Sarndal, C. E. Experiments with variance estimation from survey data with imputed values. J. Off. Stat. 10(3), 231–243 (1994).
Google Scholar
Searls, D. T. The utilization of a known coefficient of variation in the estimation procedure. J. Am. Stat. Assoc. 59(308), 1225–1226 (1964).
Article Google Scholar
Singh, H. P. & Horn, S. An alternative estimator for multi-character surveys. Metrika 48, 99–107 (1998).
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research project was supported by the Researchers Supporting Project Number (RSPD2024R1004), King Saud University, Riyadh, Saudi Arabia.

Author information

Authors and Affiliations

Department of Statistics, University of Lucknow, Lucknow, 226007, India
Shashi Bhushan
Department of Statistics, Faculty of Basic Science, Central University of Haryana, Mahendergarh, 123031, India
Anoop Kumar
Department of Mathematics and Statistics, Dr. Shakuntala Misra National Rehabilitation University, Lucknow, India
Rohini Pokhrel
Department of Statistics and Operations Research, College of Science, King Saud University, P.O. Box 2455, Riyadh, 11451, Saudi Arabia
M. E. Bakr
Department of Statistics, Wachemo University, Hosaina, Ethiopia
Getachew Tekle Mekiso

Authors

Shashi Bhushan
View author publications
You can also search for this author in PubMed Google Scholar
Anoop Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Rohini Pokhrel
View author publications
You can also search for this author in PubMed Google Scholar
M. E. Bakr
View author publications
You can also search for this author in PubMed Google Scholar
Getachew Tekle Mekiso
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.B.: supervision, review and editing; A.K.: writing original manuscript, methodology, simulation study, software, review and editing; R.P.: software and data curation; M.E.B. and G.T.M.: funding and project administration.

Corresponding author

Correspondence to Getachew Tekle Mekiso.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bhushan, S., Kumar, A., Pokhrel, R. et al. Design based synthetic imputation methods for domain mean. Sci Rep 14, 4250 (2024). https://doi.org/10.1038/s41598-024-53909-0

Download citation

Received: 04 September 2023
Accepted: 06 February 2024
Published: 21 February 2024
DOI: https://doi.org/10.1038/s41598-024-53909-0

Keywords

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Entropy, irreversibility and inference at the foundations of statistical physics

Genome-wide association studies

Two common and distinct forms of variation in human functional brain networks

Introduction

Methodology and notations

Adapted imputation methods

Conventional mean imputation method

Synthetic ratio imputation methods

Theorem 2.1

Synthetic logarithmic imputation methods

Theorem 2.2

Proposed synthetic Searls type logarithmic imputation methods

Special case

Theorem 3.1

Proof

Efficiency conditions

Lemma 4.1

Lemma 4.2

Lemma 4.3

Simulation study

Key results of simulation study

Real data application

Conclusions

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Comments

Search

Quick links