Design based synthetic imputation methods for domain mean

In real life, situations may arise when the available data are insufficient to provide accurate estimates for the domain, the small area estimation (SAE) technique has been used to get accurate estimates for the variable under study. The problem of missing data is a serious problem that has an impact on sample surveys, but small area estimates are especially prone to it. This paper is a basic effort that suggests design based synthetic imputation methods for the domain mean estimation using simple random sampling in order to address the issue of missing data under SAE. The expression of the mean square error for the proposed imputation methods are obtained up to first order approximation. The efficiency conditions are determined and a thorough simulation study is carried out using artificially generated data sets. An application is included with real data that further supports this study.

The majority of surveys are only intended to offer estimates at the national and/or state/territory geographic levels that are statistically valid and design-based.Implementing and carrying out sample surveys that would produce accurate estimates at levels smaller than state/territory would be extremely difficult and expensive, both in terms of the larger sample sizes needed and the increased burden on survey respondents.Small area estimates are produced using small area estimation (SAE) techniques to get beyond the issue of small sample numbers and outperform the accuracy of direct survey estimates derived from the sample in each small region.Direct, synthetic, and other indirect estimations are some of the techniques used for SAE.The direct estimators solely employ information from the specified region under study.Mostly, they are unbiased, but very unstable having large variation.Indirect and composite estimators are more accurate because they additionally include information from related variables or nearby areas.
The direct estimators have been shown to produce unacceptable large standard errors as a result of asymmetric small samples from the relevant small area.In reality, there may be circumstances when no sample units can be selected from a portion of small domains.Finding indirect (synthetic) estimators, that dramatically increase sample size and subsequently reduce the standard error of the estimator is therefore necessary to achieve appropriate statistical accuracy.According to Gonzalez 1 "an estimator is called a synthetic estimator if a reliable direct estimator for a large area, covering several small domains, is used to derive an indirect estimate for a small domain, under the assumption that the small areas have the same characteristics as the large area".Developing indirect estimators for small areas is necessary since there is a lack of sufficient sample data in small geographic areas.Numerous researchers, particularly in the fields of health, agriculture, and poverty, have developed synthetic estimators.According to recent research by Tikkiwal and Ghiya 2 , Pandey and Tikkiwal 3 , Tikkiwal et al. 4 , Ashutosh et al. 5,6 , Bhushan et al. 7 , small area estimators based on auxiliary information outperform those that exclude it.
The issue of missing data is persistent in sample surveys and necessitates quick action to prevent the validity of any conclusions drawn from such data.The properties such as unbiasedness and efficiency of the estimators might both be compromised by the missing data.Imputation of missing data is the preferred and most often used method for dealing with missing data.Rubin 8 proposed three fundamental conceptions in his landmark work: missing at random (MAR), observed at random (OAR), and parameter distribution (PD).A discrimination between missing at random (MAR) and missing completely at random (MCAR) was provided by Heitjan and Basu 9 .Many renowned writers have addressed the issue of missing data, and different imputation approaches have been used to fill in the gaps.The accessibility of adequate supplementary information is critical for the creation of effective imputations schemes.Numerous prominent researchers, including Rueda et al. 10 , Toutenburg and Srivastava 11 , Toutenburg et al. 12 , Singh and Horn 13 , Prasad 14 , Singh and Deo 15 , Singh 16 , Ahmed et al. 17 , Bhushan and Pandey 18,19 , Bhushan et al. 20,21 , Prasad 22 , Prasad and Yadav 23 , Bhushan and Kumar 24 have studied in this field and developed imputations and the corresponding estimators for missing data utilizing auxiliary information.In this study, we use the MCAR approach to impute missing data altogether.
Further, in literature, no imputation method is available to solve the issue of missing data under SAE.Therefore, the objectives of this article are: (i) to propose some fundamental imputations, namely, mean, ratio, logarithmic type for estimating the domain mean; (ii) to propose Searls type logarithmic imputation methods estimating the domain mean; (iii) to compare the fundamental imputations with our Searls type logarithmic imputation methods.
Note that while imputing the missing observations, we do not modify the original responses.The methodology and notations used in this study are discussed below.

Methodology and notations
Consider a specified population � = {1, 2, . . ., N} of the size N from which a simple random sample s of the size n is drawn without replacement.In order to estimate the mean of domain d, we use the information collected in the sample.Further, let r d and r be the amount of units responding from chosen n d and n units and let R d and R be the set of units responding in the domain d and total population, respectively.Also, Rd and R symbol- ize the set of units non-responding in the domain d and total population, respectively.For all units, i ∈ R , the quantity y i is obtained, but for the units i ∈ R , the quantities are missing and imputed data must be obtained to finalize the formation of sample data set.Suppose, the imputation is accomplished comprising the additional auxiliary information, X, so X i , the value of X for unit i, is available and positive for all i ∈ s such that the data X s = {X i ; i ∈ s} are available.
To derive the mean square error (MSE) of the consequent synthetic estimators of the proposed synthetic imputation methods, we take the following notations: x , where, f r = 1 r − 1 N and f n = 1 n − 1 N , C y and C x are the coefficient of variation of study and auxiliary variables, respectively, ρ yx is the correlation coefficient between study and auxiliary variables.
The content that follows is broken up into a few sections.In "Adapted imputation methods" and "Proposed synthetic Searls type logarithmic imputation methods", respectively, the adapted and proposed imputation methods are presented together with formulae for the mean square error (MSE).In "Efficiency conditions", a comparison of the various imputation strategies is given.In "Simulation study", a comprehensive simulation analysis using a few artificial populations is provided, and the main simulation results are explored.In "Real data application", an actual data application is also provided.In "Conclusions", this article is concluded with some concluding remarks.

Adapted imputation methods
Since literature contains no imputation methods to deal with the problem of estimation of mean of domain d in the presence of missing data.Therefore, we adapt some conventional imputation methods for the estimation of domain mean.

Conventional mean imputation method
When information on the auxiliary variables is not available, then the conventional mean imputation method is the obvious choice.When the ith sample unit in domain d is missing and requires imputation, we suggest the mean imputation of domain mean by amplifying the notations of Lee et al. 25 for unit value imputation.The synthetic mean imputation technique for domain mean is given by The consequent synthetic estimator is The MSE of the consequent synthetic mean estimator is The imputation approaches are distinguished into two schemes when additional auxiliary information is taken into account.
Scheme I: When Xd is known and xn,d is used.Scheme II: When Xd is known and xr,d is used.

Synthetic ratio imputation methods
The ratio imputation method provides efficient results when the study and auxiliary variables are positively correlated.The classical synthetic ratio imputation methods under schemes I and II are defined as The consequent synthetic ratio estimators under above schemes are Theorem 2.1 The MSE of the consequent synthetic ratio estimators t r j , j = 1, 2 of the synthetic ratio imputation methods y .ir j under schemes I and II is given by

Synthetic logarithmic imputation methods
The proposed synthetic logarithmic imputation methods under schemes I and II are given below.
Scheme I

Scheme II
The resulting estimators are calculated under the schemes described above as where θ j ; j = 1, 2 are the suitably chosen scalars.

Theorem 2.2
The MSE and minimum MSE of the consequent synthetic estimators t l j , j = 1, 2 of the proposed synthetic imputation methods y .il j under schemes I and II are given by

Proposed synthetic Searls type logarithmic imputation methods
In order to increase the effectiveness of the estimators, Searls 26 developed a transformation that required multiplying a tuning parameter in the estimators.Therefore, in order to improve the above works, we used a tuning parameter δ j , j = 1, 2 in the synthetic logarithmic imputation methods y .il j and propose synthetic Searls type logarithmic imputation methods for the mean of domain d utilizing auxiliary information in SRS.
The proposed synthetic Searls type logarithmic imputation methods under schemes I and II are given below.Scheme I where δ j , j = 1, 2 are the suitably chosen scalars.The resulting synthetic estimators are calculated under the schemes described above as

Special case
When δ j = 1, j = 1, 2 , then under schemes I and II, the proposed synthetic Searls type logarithmic imputation methods y .is j and the corresponding resultant synthetic Searls type logarithmic estimators t s j deform into the synthetic logarithmic imputation methods y .il j and the corresponding resultant synthetic logarithmic estimators t l j , respectively.

Theorem 3.1
The MSE and minimum MSE of the consequent synthetic estimators t s j , j = 1, 2 of the proposed synthetic imputation methods y .is j under schemes I and II are given by where Proof Consider the proposed consequent synthetic estimator t s 1 as We can express the above estimator using the notations established in the previous section as Simplifying the above expression and neglecting the higher order error terms, we get and www.nature.com/scientificreports/Subtracting Ȳd on both sides to the above expression, we get Squaring and taking expectation both sides to (4), we get MSE of the estimator t s 1 to the first order approxi- mation as Under the assumption of Searls logarithmic synthetic estimation Ȳ (1 + θ 1 A) = Ȳd , the MSE(t s 1 ) can be expressed as where Partially differentiating (6) regarding δ 1 and equating to zero, we get the optimum value of δ 1 as Putting the optimum value of δ 1 from the above expression to (6), we get minimum MSE of the estimator t s 1 as Similarly, the first order approximated expressions of MSE and minimum MSE of the proposed synthetic estimator t s 2 can be obtained.

Efficiency conditions
In the present section, we compare the minimum MSE of the proposed synthetic imputation methods with the corresponding minimum MSE of the existing synthetic imputation methods under schemes I and II.

Lemma 4.1
The proposed synthetic Searls type logarithmic imputation methods y .is j , j = 1, 2 dominate the synthetic mean imputation method y .im , if Lemma 4. 2 The proposed synthetic Searls type logarithmic imputation methods y .is j , j = 1, 2 dominate the synthetic ratio imputation methods y .ir j under schemes I and II, if Lemma 4. 3 The proposed synthetic Searls type logarithmic imputation methods y .is j , j = 1, 2 dominate the synthetic logarithmic imputation methods y .il j under schemes I and II, if The proposed synthetic Searls type logarithmic imputation methods repress the synthetic mean per unit imputation method, synthetic ratio imputation methods and synthetic logarithmic imputation methods, if the ( 5) Vol:.( 1234567890)

Simulation study
A simulation study is executed to assess the effectiveness of the suggested synthetic imputation methods in comparison to the adapted synthetic imputation methods.In the simulation procedure, certain symmetrical and asymmetrical populations are produced in accordance with the models employed by Singh and Horn 27 .The model used are as follows: where x * and y * are independent variables for the corresponding distributions.Considering the above models, we have generated the below mentioned populations: 1.A Normal population of size N=6000 using x * ∼ N(12, 35) and y * ∼ N(13, 45) with varying correlation coefficients ρ xy =0.1, 0.5, 0.9. 2. A Gamma population of size N=6000 using x * ∼ G(0.02, 0.006) and y * ∼ G(0.2, 0.011) with varying cor- relation coefficients ρ xy =0.1, 0.5, 0.9.
(i) Select a sample s of size n randomly from the population of size N.
(ii) Bring out randomly ( n d -r d ) sample units through sample s every time.
(iii) Impute selected units by considering the proposed imputation methods studied for quantified samples.
(iv) Compute the needed statistics.(v) Iterated the prior steps 15,000 times.
The empirical (simulated) mean square error (EMSE) and the theoretical mean square error (TMSE).The TMSE is calculated using the MSE expressions of the respective estimators obtained in "Adapted imputation methods" and "Proposed synthetic Searls type logarithmic imputation methods", while the EMSE is calculated utilizing the following formula: where t * =t m , t r j , j = 1, 2 , t l j , t s j .The results of the consequent synthetic estimators for normal and gamma populations are reported in Tables 1  and 2, respectively.

Key results of simulation study
We interpret the key results of simulation study summarized from Tables 1 to 2 in the following points.
1.The outcomes drawn from normal population for the consequent synthetic estimators are reported in Table 1.
These outcomes show that: (a) the EMSE and TMSE of the consequent synthetic ratio estimator t r 1 under scheme I decreases with the successive increase in the correlation coefficient ρ xy from 0.1 to 0.9.This tendency in the EMSE and TMSE values of t r 1 can be also observed from scheme II for the estimator t r 2 .(b) the EMSE and TMSE of the consequent synthetic logarithmic estimator t l 1 under scheme I decreases with the successive increase in the values of correlation coefficient ρ xy from 0.1 to 0.9.This tendency in the EMSE and TMSE values of t l 1 can be also observed from scheme II for the consequent synthetic logarithmic estimator t l 2 .(c) the EMSE and TMSE of the consequent synthetic Searls type logarithmic estimator t s 1 under scheme I decreases with the successive increase in the correlation coefficient ρ xy from 0.1 to 0.9.This tendency in the EMSE and TMSE values of t s 1 can be also observed from scheme II for the consequent synthetic Searls type logarithmic estimator t s 2 .(d) the EMSE and TMSE of the consequent synthetic ratio estimators, synthetic logarithmic estimators, and synthetic Searls type logarithmic estimators decreases with the increase in the responding units r d under schemes I and II in each domain.www.nature.com/scientificreports/(e) the EMSE and TMSE of the consequent synthetic ratio estimators, synthetic logarithmic estimators, and synthetic Searls type logarithmic estimators under both schemes in each domain are observed to be very close to each other.(f) the consequent synthetic Searls type logarithmic estimators t s j , j = 1, 2 perform better than the adapted synthetic mean estimator t m , synthetic ratio estimators t r j , and synthetic logarithmic esti- mators t l j under schemes I and II.
2. The similar tendency as observed from the results of Table 1 obtained from normal population for synthetic estimators can also be observed from the results of Table 2 obtained from gamma population for synthetic estimators.3. Finally, from the results of Tables 1 and 2, the performance of the synthetic ratio estimators, synthetic logarithmic estimators, and synthetic Searls type logarithmic estimators is better under scheme II compared to scheme I.

Real data application
Like most other Indian states, Uttar Pradesh is separated into a several districts for the purpose of taking taxes and conducting other administrative and agricultural works.Each district is further separated into a number of tehsils, and each tehsil is further separated into several blocks.Blocks are referred to as small domains in this study.
Since the area used for cultivation determines the yield of every crop.Therefore, for applications using real data, we take into account the problem of estimating agricultural output for various blocks in the Agra district of Uttar Pradesh.Six blocks in the Agra district are referred as small domains.The amount of Bajra crop produced (in tonnes) for the agricultural season 2021-2022 is regarded as the study variable y, whilst the area of Bajra crop produced (in hectares) for the agricultural season 2021-2022 is regarded as the auxiliary variable x.Various information regarding the blocks of Agra district are reported in Table 3, whereas for easy reference, the parameters for each domain are shown in Table 4.
The results based on the real data for synthetic estimators are reported in Table 5, respectively, which show the dominance of the proposed synthetic Searls type logarithmic imputation methods over the corresponding synthetic mean, ratio, and logarithmic type imputation methods.Under both schemes, the proposed synthetic imputation methods outperform the corresponding synthetic mean, ratio, and logarithmic type imputation methods.The MSE of the adapted and proposed synthetic estimators decreases as the responding units increase under both schemes in each domain.Moreover, the adapted synthetic ratio imputations, synthetic logarithmic imputations and the proposed synthetic Searls type logarithmic imputations perform better in scheme II compared to scheme I.

Conclusions
In the current article, we have adapted synthetic mean, ratio, and logarithmic imputation methods, while proposing synthetic Searls type logarithmic imputation methods for the estimation of domain mean in the case of missing data under simple random sampling.The algebraic expressions of MSE for the proposed imputation methods are derived to first order approximation.The algebraic conditions are obtained by comparing the MSE expressions of the proposed and adapted imputations.Furthermore, a comprehensive simulation is executed using a deliberately drawn normal (symmetric) and gamma (asymmetric) population in order to assess the performance of the suggested imputation approaches.The EMSE and TMSE obtained in simulation study show that for varying amounts of correlation coefficient as well as responding units in each domain, the suggested synthetic Searls type logarithmic imputation techniques excel compared to the adapted synthetic mean, ratio, and logarithmic imputation methods.Further, from the results of Tables 1 and 2, the EMSE and TMSE of the adapted and suggested estimators are observed to be very close to each other under both the schemes in each domain.In addition, an actual data set based on the production of Bajra crops in the Agra district of Uttar Pradesh, India, is also used to demonstrate the applicability of the suggested imputation approaches.The results of the real data also favour the suggested imputations compared to the adapted imputations.Therefore, under SAE, if missing data is identified, survey practitioners may be advised to employ the suggested imputation procedures.
Scheme II

Table 1 .
EMSE and TMSE of synthetic estimators under normal population.

Table 2 .
EMSE and TMSE of synthetic estimators under gamma population.

Table 3 .
Total production and area under Bajra crop in Blocks of Agra district for agricultural season 2021-2022. S.

Table 4 .
Population parameters for different domains.