Estimation of Finite Population Variance Using Scrambled Responses in the Presence of Auxiliary Information

In this article, a new estimator for estimating the finite population variance of a sensitive variable based on scrambled responses collected using a randomization device is introduced. The estimator is then improved by using known auxiliary information. The estimators due to Das and Tripathi (1978: Sankhya) and Isaki (1983: JASA) are shown to be special cases of the proposed estimator. Numerical simulations are performed to study the magnitude of the gain in efficiency when using the estimator with auxiliary information with respect to the estimator based only on the scrambled responses. An idea to extend the present work from SRSWOR design to more complex design is also given.


Introduction
The collection of data through personal interview surveys on sensitive issues such as induced abortions, drug abuse, and family income is a serious issue. For example, some questions are sensitive: (a) By how much did you underreport your income on your 2009 tax return? (b) How many abortions have you had? (c) How many children have you molested? (e) Do you use illegal drugs? Randomized response techniques are one way to get people to answer truthfully. Horvitz et al. (1967) and Greenberg et al. (1971) have extended Warner's (1965) model to the case where the responses to the sensitive question are quantitative rather than a simple "yes" or "no". The respondent selects, by means of a randomization device, one of the two questions: one being the sensitive question, the other being unrelated. As pointed out in Eichhorn and Hayre (1983) there are several difficulties that arise when using this unrelated question method. The main one is that of choosing the unrelated question. As Greenberg et al. (1971) note, it is essential that the mean and variance of the responses to the unrelated question be close to those of the sensitive question: otherwise, it will often be possible to recognize from the response which question was selected. However, the mean and variance of the responses to the sensitive question are unknown, making it difficult to choose good unrelated question. A second difficulty is that in some cases the answers to the unrelated question may be more rounded or regular, making it possible to recognize which question was answered. For example, Greenberg et al. (1971) considered the sensitive question: about how much money did the head of this household earn last year. This was paired with the question: about how much money do you think the average head of a household of your size earns in a year. An answer such as $26,350 is more likely to be in response to the unrelated question, while an answer such as $18,618 is almost certainly in response to the sensitive question. A third difficulty is that some people are hesitant to disclose their answer to the sensitive question even though they know that the interviewer cannot be sure that the sensitive question was selected. For example, some respondents may not want to reveal their income even though they know that the interviewer can only be 0.75 certain, say, that the figure given is the respondent's income. These difficulties are no longer present in the scrambled randomized response method introduced by Eichhorn and Hayre (1983). This method we summarize as follows: Each respondent scrambles their response Y by multiplying it by a random variable S and then reveals only the scrambled result Z = YS to the interviewer. Thus, the scrambled randomized response model maintains the privacy of the respondents. Assumptions of the model: (a) The variable S is called a scrambling variable and its distribution is assumed to be known. (b) The study variable Y and the scrambling variable S are independent. (c) The selection of the sample units and the randomization procedure are independently carried out, (d) The randomization procedure is independently performed on each individual, and (e) In particular, the quantities E(S) = θ and γ a = E (S − θ ) a for a = 2, 3, 4 are assumed to be known. Diana and Perri (2009, 2010, 2011, 2012 and Perri (2008) have rightly pointed out that in direct question survey techniques, when dealing with nonsensitive questions, it is very common to use auxiliary information to improve estimation strategies. A very limited effort has been made to make use of auxiliary information to improve the estimators of sensitive variables, as one might see in referring to Singh et al. (1996), Strachan et al. (1998), Tracy and Singh (1999), Ryu et al. (2005Ryu et al. ( /2006, Mahajan (2005Mahajan ( /2006Mahajan ( , 2007, Son et al. (2008), Sidhu and Bansal (2008), Zaizai et al. (2008), and Singh and Kim (2011). An extensive review of the literature on randomized response sampling can be found in a recent monograph by Chaudhuri (2011).
To our knowledge, no one has made any attempt to study an estimator of the finite population variance of the study variable using scrambled responses and making use of an auxiliary variable to improve the estimator.

Notation
Assume that a simple random sample done without replacement (SRSWOR) of size n is drawn from the given population of N units. Let the value of the sensitive study variable, Y and the auxiliary variable, X, for the ith unit (i = 1, 2, ..., N ) of the population be denoted by Y i and X i . We define a few parameters of these variables as follows. Let Y = 1 N N i=1 Y i denote the population mean of the sensitive study variable Y and let X = 1 N N i=1 X i be the population mean of the auxiliary variable, X. In this article, we consider the problem of estimating the finite population variance (or population mean square error) of the sensitive study variable Y defined by, of the auxiliary variable, X. Let the higher ordered central moments of the study variable Y and the auxiliary variable X be given by: for a, b = 0, 1, 2, 3, 4, etc.
Let Z i = Y i S i denote the scrambled response from the ith sampled unit for i = 1, 2, ..., n. Here, we differ from the Das and Tripathi (1978) and Isaki (1983) estimators in that, instead of directly observing responses Y i on the sensitive study variable, we observe the scrambled responses, Z i = Y i S i .

Naive Estimator of the Finite Population Variance
Following Eichhorn and Hayre (1983) the mean of the response, Y , can be estimated from a sample of scrambled Z values by using the knowledge of the distribution of the scrambling variable S. The sample variance of the scrambled responses is given by: Let E R denote the expected value over the randomization device. Taking expected value E R on both sides above, we get: An unbiased estimator of the finite population variance is given by Theorem 2.1. The variance of the estimator s * 2 y is given by Proof. (See Appendix-A). Theorem 2.2. An estimator of variance of the estimator s * 2 y is given by: where using the method of momentsμ 40 is a rescaled estimator of μ 40 based on scrambled responses, andŶ t i , for t = 1, 2, 3, 4 is a rescaled estimator of Y t i based on scrambled responses. For example, we suggest usingŶ The expected values E(S k ) for k = 1, 2, 3, 4 can be had from Singh and Chen (2009) by using the concept of higher order moments of scrambling variables.

Difference Type Estimator of the Finite Population Variance
Following Das and Tripathi (1978) and Isaki (1983), we define a new, difference-type, estimator of the finite population variance, as: where B is a constant to be determined such that the variance of the estimatorσ * 2 ν is minimum. The proposed estimatorσ * 2 ν is an unbiased estimator of the finite population variance S 2 y , for any fixed value of B, and so in, particular, is unbiased, and has minimum variance, for the optimal value of B given by: The minimum variance of the proposed estimatorσ * 2 ν then becomes:

Regression Type Estimator
In practice the value of B is unknown so the proposed estimatorσ * 2 ν becomes difficult to implement in practice. Thus, we suggest a linear regression type estimatorσ * 2 lr given by: The value ofB depends upon scrambled responses, thus, it will increase the variance of the linear regression estimatorσ * 2 lr compared toσ * 2 ν . It will also make it difficult to find the value of V R (σ * 2 lr ), which would be equal to: and in fact it makes it difficult to find the exact variance V (σ * 2 lr ) of the linear regression type estimator. Thus, in Section 6, we consider comparing the linear regression type estimator σ * 2 lr to the naive unbiased estimator s * 2 y through simulation study. Remarks. IfB = s * 2 y s 2 x , then the estimatorσ * 2 ν becomes the ratio type estimator given byσ * 2 , where α is a constant, then the estimatorσ * 2 v becomes the Das and Tripathi (1978) type estimator given by:

Relative Efficiency of the Difference Estimator
The proposed difference type estimatorσ * 2 ν will be more efficient than the naive estimator s * 2 y if: which is always true. Thus, the proposed estimatorσ * 2 ν is always more efficient than the naive estimator s * 2 y .

Simulation Study for the Regression Estimator
In this section, we consider the case where the sensitive variable Y and the auxiliary variable X are related to each other by the linear model defined as: where e i ∼ N (0, 1). The auxiliary variable X i ∼ G(a, b) is generated from the gamma distribution with parameters a = 2.2 and b = 3.5. We generate a population of size N = 3000 units from the model for a given value of g and R. Then from the given population of size N = 3, 000, we select an SRSWOR sample of size n and both the study variable Y i and the auxiliary variable X i are observed for i = 1, 2, 3, ..., n. Next we generate n values of a scrambling variable S i , i = 1, 2, ..., n from the beta distribution with a given choice of α and β. In other words, we assumed the scrambling variable S ∼ B(α, β). Then, we obtained the scrambled responses on the study variable as Z i = Y i S i , i = 1, 2, ..., n from a given sample. We repeated this process T = 5, 000 times. By using information from the ordered pairs (X i , Z i ), i = 1, 2, ...., n, of the tth sample, t = 1, 2, ..., T , we computed the three estimatorsθ 0|t = s * 2 y|t ;θ 1|t =σ * 2 v|t andθ 2|t =σ * 2 lr|t . Note that we used the first four moments of the scrambling variable S given by: and No doubt theoretically the estimatorsθ 0|t andθ 1|t are unbiased estimators of the parameter of interest S 2 y , while the estimatorθ 2|t is a biased estimator. In order to see the performance of the proposed estimators, we computed the simulated relative bias in each of the three estimators as follows: We also computed the relative efficiencies of the linear regression and the difference type estimator with respect to the usual estimator as: The results obtained are presented in Table 1. The FORTRAN codes used in this simulation are given in Appendix B.
In the simulation study, we fixed α = 6.5, β = 0.5, and g = 0.5; two values of R = 0.5 and 1.5 were considered. We also used different values of the sample size n ranging between 10 and 100 with a step of 5 units. An absolute value of the percent relative bias less than 10% is regarded as acceptable by following Cochran (1977). The values of RB(θ 0 ), RB(θ 1 ), and RB(θ 2 ) fall within this range in all the cases considered in the simulation study and so relative bias is considered negligible. The value of RE(0, 2) is lower than the value of RE(0, 1) in all situations. For R = 0.5, the average RE(0, 1) value is 175.51% with a standard deviation of 1.57%, and with a minimum value of 171.70%, maximum of 178.10%, and median value 175.70%. For R = 1.5, the average RE(0, 1) value is 476.76% with a standard deviation of 4.40%,and with a minimum value of 465.50%, maximum of 485.10%, and median value 476.10%. For R = 0.5, the average RE(0, 2) value is 157.16% with a standard deviation of 16.35%, and with a minimum value of 105.70%, maximum of 170.50%, and median value 163.90%. For R = 1.5, the average RE(0, 2) value is 414.15% with a standard deviation of 28.68%, and with a minimum value of 328.40%, maximum of 436.70%, and median value 425.90%. Fig. 1 shows that the value of the simulated percent relative efficiency of the difference and regression estimators with respect to the usual unbiased estimator is almost free from the sample size. The Fig. 2 shows that the percent relative bias in the proposed estimators are between −2% to +2% for two different values of R = 0.5 and 1.5 and sample size between 10 to 100.
In the next section, we consider a simulation study based on a real dataset given in the book by Rosner (2006).

Application Based on a Real DataSet
We used the dataset, FEV.DAT, available on the CD that accompanies the text by Rosner (2006) and that contains data on 654 children from the Childhood Respiratory Disease Study done in Boston. Among the variables are age and height. While age is not usually a sensitive characteristic, for illustration purpose, we will consider it sensitive and take height as the nonsensitive characteristic. It should be noted that any variable could be a sensitive variable depending on a situation being considered as reported in Singh et al. (2008). Here, we consider the problem of estimating the variance of age of the population utilizing height  at the estimation stage. Although exact date of birth a child is available in hospitals, but still sometimes parents do not want to disclose it in public domain because when a child will grew up then all his/her credit history will be related to the exact date of birth. It is also very common among girls that most the girls like to hide their age because girls feel it is sensitive information for them telling others they are not young. The rest of the procedure of the simulation study has been kept the same except that the synthetic data considered in Section 6 is replaced by the real dataset, and the sample size was changed from 10 to 70. The results obtained are presented in Table 2. For n = 10 and 15, the linear regression estimator shows percent relative efficiency less than 100%, but for sample size more than 20 both the linear regression estimator and the difference estimators are performing very well in comparison to the proposed naive estimator. The percent relative bias values remain negligible. For n > 20, the average percent RE(0, 1) value is 121.50% with a standard deviation of 0.86%; the minimum value is 119.50% and maximum value is 122.90% with median value of 121.70%. In the same way, for n > 20, the average percent RE(0, 2) value is 114.70% with a standard deviation of 3.70%; the minimum value is 106.40% and maximum value is 118.30% with median value of 115.20%.

Complex Survey Design
Let a sample s of size n be selected from a population by using an arbitrary sampling design P (s) with inclusion probabilities π i for the ith unit and π ij for the ith and jth unit i = j = 1, 2, ..., N. Assuming π ij "s are positive for all i = j , an unbiased estimator of the population variance is given by: A Das and Tripathi (1978) and Isaki (1983) type variance estimator for the complex surveys design can be obtained as: X i X j π ij and B * is a suitably chosen constant.

Conclusion
In this article, three estimators of finite population variance based on scrambled responses and in the presence of an auxiliary variable are investigated through extensive simulation study. We conclude that both the difference and the linear regression type estimators of the finite population variance, based on scrambled responses, lead to quite satisfactory results.
We acknowledge that a simulation study is a must when carrying out a real survey, both for making a choice of a scrambling variable and for determining a minimum sample size.
Proof of Theorem 2.1: Let E P and V P denote the expected value and variance respectively over all possible samples. Similarly let E R and V R denote the expected value and variance respectively over the randomization device. Then the variance of s * 2 y over the randomization device is given by: The expected value of V R (s * 2 y ) over the sampling design P (s) is given by: Now, we have the theorem with the fact that: