Chaudhuri and Mukerjee ORRT for two sensitive characteristics and their overlap

In this paper, we extend the optional randomized response technique (ORRT) developed by Chaudhuri and Mukerjee [Optionally randomized response techniques. Bull. Calcutta Statist. Assoc. 1985;34:225–230; Randomized response: theory and techniques. New York: Marcel Dekker, Inc.; 1988] to the situation of estimating the proportion of two sensitive characteristics and their overlap. Lee, Sedory and Singh [Estimating at least seven measures of qualitative variables from a single sample using randomized response technique. Stat Prob Lett. 2013;83(1):399–409; Estimation of odds ratio, attributable risk, relative risk, correlation coefficient and other parameters using randomized response techniques. Behaviormetrika. 2021;48:371–392.] have shown that their crossed model performs better than their simple model from an efficiency point of views. Here we investigated a further improvement in the crossed model along the lines of Chaudhuri and Mukerjee [Optionally randomized response techniques. Bull. Calcutta Statist. Assoc. 1985;34:225–230; Randomized response: theory and techniques. New York: Marcel Dekker, Inc.; 1988]. New unbiased estimators are proposed, their variance expressions are derived and estimators of variances are suggested. Lastly, we carry out a simulation study to investigate the behaviour of the proposed estimators with respect to their competitors.


Introduction
Warner [5] was the first to deal with the issues of estimating the prevalence of a sensitive characteristic by introducing the idea of using a randomized response surveying technique.He introduced a randomization device as a shield between an interviewer and interviewee in face-to-face interview surveys.A randomization device could be a physical instrument such as a coin, deck of cards, spinner, or it could be a digital randomizing device such as a calculator, computer etc.The privacy of a participant obtained by masking their response through a randomization device helps to produce more accurate estimates of parameters of interest to a survey statistician.Warner considered the problem of estimating the proportion of people in a population possessing one sensitive characteristic, say group A. Chaudhuri and Mukerjee [1,2] were the first ones who felt that it may not be required to use a randomization device for all respondents selected in the sample.Some people selected in the sample may be willing to answer truthfully whether they are members of the sensitive group A or not.However, this may not be true if the sensitive question is truly sensitive, for example if someone were interested in estimating the proportion of political leaders involved in murder cases.Nevertheless, it may work if someone were to be asked about their membership in a political party.Chaudhuri and Mukerjee [1,2] named their method an optional randomized response technique (ORRT).Mangat [6] and Mangat and Singh [7] introduced a new type of ORRT where respondents are always protected and have the freedom either to report directly about membership in a sensitive group A, or to use the pioneer Warner [5] device.Thus, there are two types of ORRTs: one is due to Chaudhuri and Mukerjee [1,2] which we will call CM-ORRT, and the other is due to Mangat and Singh [7] which we will call MS-ORRT.Both ORRTs have benefits as well as limitations over each other.The main limitation of CM-ORRT is that it cannot be used if the variable of interest is universally sensitive as no one will choose to respond directly to the question of membership.The benefit of the CM-ORRT is that in the case of a partially sensitive variable it would result in an estimator of proportion that is more efficient than that obtained using randomized response techniques.Another benefit is that it helps to estimate the proportion of those who think the characteristic of interest is not particularly sensitive and so can be disclosed to an interviewer.These characteristics are what we refer to as partially sensitive variables.In contrast, MS-ORRT can be used for collecting information on truly sensitive variables, but the limitation is that it remains unknown whether a respondent is responding directly or through a device; nevertheless, a point estimate of proportion can still be obtained.Under certain assumptions MS-ORRT could also lead to an estimator that is more efficient than one that is solely based on the use of a randomization device.It may be worth pointing out that although the MS-ORRT cannot be made more efficient than the CM-ORRT, MS-ORRT provides a stronger shield of privacy protection for respondents than the CM-ORRT.Another limitation of MS-ORRT is that an interviewer and an interviewee cannot face each other at the time of interview.In MS-ORRT, the interviewer is not to know which option is taken by the interviewee so any use (or not) of the randomization device must be hidden.This can present some logistical difficulties in a face-to-face interview; however, this would not be an issue in mail surveys.
Researchers have paid attentions to both types of ORRTs for their further improvements.There are also a lot of criticisms of ORRTs, and a few researchers have extended these techniques in several directions without much motivation.We will briefly discuss extensions of both types of ORRTs along with minor criticisms.Rao and Rao [8] pointed out that stratification is known to have advantages.Researchers in the field of randomized response (RR) have routinely extended existing as well as new techniques of randomized response to stratified sampling and allocation of sample sizes, etc. Rao and Rao [8] mentioned that some of these extensions are of theoretical interest only and are not practical.To our knowledge, Kim and Warde [9] were the first to consider randomized response techniques within stratified random sampling.Likewise, complex survey sampling designs are well known to often have advantages in the field of survey sampling.Many researchers in the field of RR have extended the randomized response sampling techniques to complex survey designs, though again, these are of more theoretical than practical interest.Chaudhuri and Saha [10] consolidate some of the scattered ideas of optional randomized response techniques used with unequal probability sampling schemes.Christofides [11] pointed out that the case of simple random sampling with replacement (SRSWR) is mathematically easier to present and understand while Chaudhuri [12,13] has made the case for using complex designs in the field of randomized response sampling.It is worth pointing out that any gains are due to the use of complex designs, such as stratified sampling, rather than use of a randomization device.In general, complex designs are found to be useful whenever there is high correlation between the selection probabilities and the variable of interest.In randomized response, the variable of interest is a sensitive variable and one would rarely know whether such high correlation exists.Adhikary [14] considered the problem of estimating the variance of randomized response sampling using very difficult mathematics of complex designs than the researchers used in stratified sampling without much justification.Bouza-Herrera [15] studied the behaviour of some scrambled randomized response models under ranked set sampling; however it seems exceedingly difficult to rank people based on ones judgment when a sensitive variable is involved.
Gupta, Gupta and Singh [16] introduced the idea of estimating a second parameter when using the MS-ORRT, in particular, estimating the sensitivity level (W) of a question on the survey.Mukerjee [17] criticized the fact that not both of the estimators were unbiased.He failed to appreciate that two parameters were being estimated from a single sample and seems to have ignored the fact that biased estimators are routinely used; e.g. from a simple random sample from a normal distribution one typically estimates the mean (unbiased), the standard deviation (biased), and estimator of coefficient of variation (biased).Also, while one should be grateful to Gupta et al. [16] for promoting the estimator of W, one should also recognize that knowing the sensitivity level would be more useful before the survey is run rather than after.Mukerjee [17] claimed that the approach of Huang [18] of taking two samples to estimate two parameters is correct, but one can claim that Huang's approach is a simple mathematical exercise on the lines of Greenberg et al. [19].
Along the lines of Rao and Rao [8] one may argue that the extensions by Arnab [20] of the CM-ORRT to complex designs is also of theoretical interest, but may not be practical.Arnab and Rueda [21] and Arnab [22] extended the MS-ORRT to qualitative/quantitative variables using complex designs, which is again of questionable practicability.Shaw [23] used item counting techniques for estimating population proportion, using unequal probability sampling design which makes use of known auxiliary information.Along with considering extension to complex designs Arnab and Rueda [21] and Arnab [22] have opinions similar to those of Mukerjee [17] on the problem of estimating the sensitivity level (W) in Gupta et al. [16] model.Chaudhuri [24] and Mukerjee [17] also pointed out that it is not easy to find the mean squared error of the estimator W, however, both ignored the possibility of using bootstrapping or jackknifing, if needed.It will be worth pointing out that several researchers such as Gupta et al. [25,26] and Kalucha et al. [27] have extended results of ORRT to develop ratio and regression type estimators, but again these may be of more theoretical than practical interest per Rao and Rao [8].To our knowledge, Singh [28] first considered the problem of estimating the ratio (or product) of two sensitive variables without making use of auxiliary information which could be of further interest.Shah, Hussain and Cheema [29] considered an estimator from the class of CM-ORRTs; they end up with a conditional variance of their estimator which is not useful for computing relative efficiency.Chhabra, Das and Mehta [30] introduced a multi-stage optional unrelated question RRT model for SRS rather than complex designs Fox [31] also seems inclined to use SRS instead of complex design.The Arnab [22] method for SRS can be seen as a special case of Odumade, Arnab and Singh [32] model.A review of optional randomized response techniques can be found in Arnab and Rueda [21].Thus, to avoid such controversy on the use of various designs, we will consider only SRS in the present investigation.Now consider a population of N units in which two sensitive characteristics, say A and B are present.As an example, let A be a group of people who are smokers, and B be a group of people who are drinkers.We present such a population with a Venn diagram in Figure 1. Let be the population proportion of those who belong only to the group A, that is, the group A ∩ B c .Let be the population proportion of those who belong only to the group B, that is, the group be the population proportion of those who belong to both the groups A and B, that is, the group A ∩ B. Let be the proportion of people in the population who do not think that membership in A but not in B (i.e.membership in A ∩ B c ) is a sensitive issue, and so would respond directly to a question on membership in A ∩ B c .Note that they may or may not actually be a member of A ∩ B c .Let 1 represent the corresponding subpopulation of . Let be the proportion of people in the population who do think in A ∩ B c , is sensitive, and so would only respond to indirect questioning (using a randomization device).They also may or may not be member of A ∩ B c .Let 2 represent the corresponding subpopulation of .Let be the proportion of people in the population who do not think that joint membership in A and B (i.e.membership in A ∩ B) is a sensitive issue, so would respond to a direct question of whether they are in A ∩ B or not.(i.e. in A c ∩ B c ).We will assume they are in both (A ∩ B) or neither (A c ∩ B c ).Let 3 represent the corresponding subpopulation of .Let be the proportion of people in the population who do think joint membership in A and B (i.e.membership in A ∩ B) is a sensitive issue, so could only respond to indirect questioning (using a randomization device).Again, they are assumed to be a member of both A ∩ B or neither A c ∩ B c .Let 4 represent the corresponding subpopulation of .. Let be the proportion of people in the population who do not think that membership in B but not in A, (i.e.membership in A c ∩ B) is sensitive, so would be willing to respond directly to the question of whether or not they are in A c ∩ B. Their actual status may or may not be in A c ∩ B. Let 5 represent the corresponding subpopulation of .Let be the proportion of people in the population who do think that membership in B but not in A, (i.e.membership in A c ∩ B) is a sensitive issue so would only respond to an indirect question (using a randomization device) on such membership.Again, they may or may not actually be member of A c ∩ B. Let 6 represent the corresponding subpopulation of .For such a setup, note that where 5 and 6 are analogous to 1 and 2 of the population where considering the sensitive characteristic B in place of A. Let be the proportion of people in group A ∩ B c in the sub-population 1 .
Neither A nor B is sensitive Both A and B are sensitive be the proportion of people in group A ∩ B c in the sub-population 2 .Let be the proportion of people in the group A ∩ B in the sub-population 3 .Let be the proportion of people in the group A ∩ B in the sub-population 4 .Now the true proportion π A of people possessing the sensitive characteristic A in the entire population can be written as Note that in (15) It is worth pointing out that the assumption about populations 3 and 4 are there in order to reduce the complexity of computations.
One might also proceed by considering the sixteen possibilities based on 'Thinking' and 'Status' of a respondent as shown in 4 × 4 contingency Table 1.
From Table 1, the parameters of interest will be and Table 1 indicates complexity of the use of CM-ORRT while dealing with more than one sensitive characteristic.
In the next section, we propose estimators of π A , π AB and π B by making use of six options available to the respondents selected in the sample; one might also develop new estimators by the approach illustrated in Table 1.

Proposed optional randomized response technique estimators
We define a few new notations which will be useful, when incorporating the optional randomized response technique, in developing our new estimators of the three parameters.
Suppose we have selected a random sample s of n units by the SRSWR scheme.Let s 1 be the sub-sample of n 1 units from the sub-population 1 .
Let s 2 be the sub-sample of n 2 units from the sub-population 2 .Let s 3 be the sub-sample of n 3 units from the sub-population 3 .Let s 4 be the sub-sample of n 4 units from the sub-population 4 .Let s 5 be the sub-sample of n 5 units from the sub-population 5 .Let s 6 be the sub-sample of n 6 units from the sub-population 6 .
Note that where s 5 and s 6 are analogous to s 1 and s 2 of the sample s while considering the sensitive characteristic B in place of A.
Under the assumption that all sub-sample sizes are greater than or equal to one, that is, n i ≥ 1 for i = 1, 2, 3, 4, 5, 6 we have the following results.
Let x 1 be the number of 'yes' responses from the n 1 respondents in the sub-sample s 1 to the direct question, 'Are you member of the group A ∩ B c ?' Thus , that is, x 1 follows a binomial distribution with parameters n 1 and π d a .Obviously, is an unbiased estimator of π d a , with variance given by Let x 2 be the number of 'yes' responses from the n 2 respondents in the sub-sample s 2 obtained through a randomization device, say, Warner [5] model with the two questions, 'Are you member of the group A ∩ B c ?' with device parameter P 0 , and 'Are you member of ).Obviously, the probability of a 'yes' answer from the sub-sample s 2 will be: In other words, x 2 ∼ B(n 2 , θ W ), that is, x 2 follows a binomial distribution with parameters n 2 and θ W . Obviously, is an unbiased estimator of π r a with variance given by Let x 3 be the number of 'yes' responses in the sub-sample s 3 of n 3 respondents through direct questioning on whether or not they are members of both groups (i.e.membership in both (A ∩ B)).
Thus x 3 ∼ B(n 3 , π d ab ), that is, x 3 follows a binomial distribution with parameters n 3 and π d ab . Obviously, be an unbiased estimator of π d ab with variance given by For the fourth sub-sample s 4 of n 4 respondents, we suggest using the crossed model due to Lee, Sedory and Singh [3,4] which we briefly review below for the readers.
Each respondent in the sub-sample s 4 of n 4 respondents is a member of the sub population 4 who would be requested to use the crossed model.Each respondent in the sub-sample s 4 of n 4 respondents is provided with two shuffled decks of cards marked as Deck-I and Deck-II.Each deck is comprised of two types of cards, indicating that the drawer possesses, or does not possess a sensitive characteristic, and are presented in proportions as shown in Figure 2.
Then each respondent is requested to draw one card from each deck, and read the statements in order.The respondent first matches his/her status with the statement written on the card taken from the first deck, and then he/she matches his/her status with the statement written on the card taken from the second deck.In neither case is the card message revealed to the interviewer.In the crossed model, the composition of the decks is as follows.Deck-I consists of cards, each bearing one of two mutually exclusive statements: 'I belong to the sensitive group A', with probability P and 'I belong to the non-sensitive group B c ', with probability (1 − P) respectively.Deck-II also consists of cards, each bearing one of two mutually exclusive statements: 'I belong to the sensitive group B' with probability T and 'I belong to the non-sensitive group A c ' with probability (1 − T) respectively.From the population 4 , by following the notation of Lee, Sedory and Singh [3,4] of the crossed model, the probabilities of obtaining, (yes, yes), (yes, no), (no, yes) and (no, no) responses are, respectively, given by: and Let θ * r 11 , θ * r 10 , θ * r 01 and θ * r 00 be the observed proportions of (yes, yes), (yes, no), (no, yes) and (no, no) responses in the sub-group s 4 based on Lee et al. [3,4].An unbiased estimator of π r ab is given by Let be an unbiased estimator of W i , i = 1, 2, 3, 4,5,6.Note that Assuming n i ∼ MultiNomial(n, W i ), for i = 1, 2, 3, 4, 5, 6, we have the following lemma: Lemma 2.1: The variance of Ŵi is given by and the covariance between Ŵi and Ŵj is given by Now we have the following theorem: Proof: See online documented Appendix A.

Theorem 2.2:
The variance of the estimator π k A is given by: where and 2  (41) Proof: See online documented Appendix A.

Theorem 2.3: An estimator of V( π k A
) is given by where with and Proof: Obvious by the method of moments.
In the next theorem, we consider the problem of estimating π ab the population proportion of people those possessing both sensitive characteristics A and B. Theorem 2.5: The variance of the estimator π k ab is given by Proof: See online documented Appendix A.
Theorem 2.6: Proof: Obvious by the method of moments.
Next we consider the problem of estimating π B , the population proportion of those possessing the sensitive characteristic B, which is analogous to the problem of estimating π A .
Theorem 2.7: An unbiased π k B of π B is given by: where Ŵ5 and Ŵ6 are analogous to Ŵ1 and Ŵ2 while considering the characteristic B instead of A.
Proof: It follows from previous theorems.

Theorem 2.8:
The variance of the estimator π k B is given by: Proof: It follows from previous theorem.
Theorem 2.9: Proof: It follows from the method of moments.
In the next section, we simulated situations where the proposed optional randomized response method can perform better than the Lee, Sedory and Singh [3,4] estimators based on their crossed model. and In the SAS macro created for the simulation, we set the parameter values to P = 0.7, T = 0.7, P 0 = 0.3, π A = 0.05, π B = 0.05, and π AB = 0.02.We then assume that π r a| 4 = 0.1π A , π r b| 4 = 0.1π B , and π r ab = 0.1π ab .The values of REA, REB and REAB then depend on the choice of W 2 , W 4 , and W 6 , and are free from the value of sample size.Now in the first row in Table 2 if W 2 = W 4 = W 6 = 0.05, that is 5% of the people think (overall 15%) that they should reply using Warner's device irrespective of their membership, then the value of REA is 557.7%, of REB is 557.7% and of REAB is 1110.6%.When the values of W 2 = W 4 = 0.05 and W 6 = 0.10, the value of REA is 557.7%, of REB is 376.9% and of REAB is 1110.6%, and so on.In the 10th row of Table 2, when the values of W 2 = 0.05, W 4 = 0.10, and W 6 = 0.15, that is 30% of the people like to use Warner's device then the values of REA = 475.4%,REB = 261.5% and REAB = 717.8.In the same way, the rest of the Table 2 can be interpreted.
In Table 3 we provide summary results of the values of REA, REB and REAB.For the value of π A = 0.05, π B = 0.05, and π ab = π AB = 0.02, then the values of W 2 , W 4 , and W 6 are considered in such a way that sum is less than 50% which means less than 50% people are using randomized response device.Then the average value of REA is μ REA = 311.220%with a standard deviation of σ REA = 113.241%, the average value of REB is μ REB = 320.533%with a standard deviation of σ REB = 113.735%,and the average value of REAB is μ REAB = 725.981%with a standard deviation of σ REAB = 301.130%.Likewise rest of the results from Table 3 can be interpreted.A graphical presentation of the REA, REB and REAB values for the various values of π A , π B , and π AB considered are shown in nine panels in Figure 3.
One point is clear from Figure 3: as the values of PIA, PIB and PIAB are close to zero, the RE bars are taller, indicating that the proposed optional randomized response technique could be beneficial when all the three proportions are small, which is a recommendation for its practicality.Table 4 provides the behaviour of the average values of REA, REB and REAB along with standard deviations for different combinations of the choice of W 2 , W 4 and W 6 .If W 2 = 0.05, W 4 = 0.05, and W 6 = 0.05 means that 15% people think those questions could be sensitive irrespective of their own statuses, then the average value of REA is 327.495% with a standard deviation of 113.976%, the average value of REB is 351.156% with a standard deviation of 112.511%, and the average value of REAB is 746.409% with a standard deviation of 260.671%.If W 2 = 0.05, W 4 = 0.05, and W 6 = 0.10 means that 20% people think those questions could be sensitive irrespective of their own statuses, then the average value of REA is still 327.495%with a standard deviation of 113.976%, the average value of REB reduces to 258.932% with a standard deviation of 68.484%, and the average value of REAB is same as 746.409% with a standard deviation of 260.671%.Clearly as the value of W6 increased to 10% then there is decrease in the average value of REB with smaller value of standard deviation because more people think that responding to question B could be a sensitive issue than the question on A or A and B. Likewise the other results in the Table 4 can be interpreted.A pictorial presentation of the raw results is displayed in Figure 4.
From the panel in the first row and first column, a lower value of W 2 leads to higher value of REA, and from the second row and third column a low value of W 6 shows higher value of REB, and from the third row and second column a low value of W 4 shows higher value of REAB.Thus, the optional randomized response model for two (or more) sensitive characteristics is not always more efficient than the estimators based on crossed-model.We noted through simulation studies that if less than 50% people will think that the three types of questioning on A, AB and B are sensitive issues then the proposed optional randomized response model can perform better than the crossed model estimates for all the three estimators.This result is different than the pioneer result of Chaudhuri and Mukerjee [1,2] which leads to an always efficient estimator than the Warner's model in case of single sensitive characteristic.Thus, we leave it as an open question of further investigation whether an optional randomized response method could be developed with estimates more efficient than the three estimators obtained from the crossed model by Lee, Sedory and Singh [3,4].The more detailed work can be found from Pushadapu [33] and a part of this work is presented by Pushadapu and Singh [34].

Figure 2 .
Figure 2. Ordered pair of decks in crossed model.

Theorem 2 . 4 :
An unbiased estimator of π ab is given by πk See online documented Appendix A.

Figure 3 .
Figure 3. Values of REA, REB, and REAB for various values of PIA, PIB and PIAB.

Figure 4 .
Figure 4. Graphical visualization of effect of W2, W4 and W6 on the values of REA, REB and REAB.

Table 1 .
Complexity of the CM-ORRT with multiple sensitive questions.

Table 2 .
Complete set of results can be obtained by executing the SAS Macro.

Table 3 .
Summary of the values of REA, REB, and REAB.

Table 4 .
Average values and standard deviations of REA, REB and REAB for various choices of W 2 , W 4 and W 6 .