Nonparametric regression with nonignorable missing covariates and outcomes using bounded inverse weighting

We consider nonparametric regression where the covariate and the outcome variable are both subject to missingness. Previous work only discussed one of the variables that may be missing, but not both. Since missing at random is not an appropriate assumption in such a nonmonotone missing data context, we shall assume a missing not at random mechanism. We construct an inverse probability weighting local polynomial estimator based on a recently developed nonmonotone missing data model. It is well known that if the inverse probability weighting is too large at some fully observed cases, the resulting estimator would be deteriorated. To overcome this issue, we introduce a constrained maximum likelihood estimation and an estimating equations method to ensure that the resulting weighting is bounded. We prove the asymptotically normal result for the resulting regression estimator. Simulation results show good practical performance of our method. A real data example is also presented.


Introduction
Missing data often arise in social and medical studies.If the missing data mechanism is not independent of the variables of interest, i.e. is not missing completely at random (MCAR), then simply deleting incomplete cases and conducting a complete case analysis usually leads to biased results; see Little and Rubin (2002) for the taxonomy of missing data mechanisms.Tremendous work has been done in the literature to develop consistent regression estimators where variables are subject to missingness, most of which are based on parametric models.A popular technique to deduce a consistent estimator from a complete case estimator is weighting the estimating equations by inverse of the conditional probability of being a complete case, namely, inverse probability weighting (IPW, Robins, Rotnitzky, andZhao 1994, 1995).To improve efficiency, many IPW methods are enhanced with augmented terms, and the resulting estimators may have the so-called doubly robust property (Scharfstein, Rotnitzky, and Robins 1999), i.e. to ensure the consistency of the parameter of interest, it suffices that either the missing data model or the regression model is correctly specified; see Rotnitzky and Vansteelandt (2014) for a detailed review of doubly robust methods.
Nonparametric regression methods to deal with missing data were discussed relatively less.For missing covariates, Wang, Wang, Gutierrez, and Carroll (1998) used IPW local linear regression in generalised linear models; Liang, Wang, Robins, and Carroll (2004) considered a nonparametric estimator in a partially linear model.For missing outcomes, Wang, Rotnitzky, and Lin (2010) developed a doubly robust local linear estimator and Sun, Wang, and Han (2020) generalised it to the multiple robustness; Chen, Fan, Li, and Zhou (2006) constructed a few local quasi-likelihood estimators; estimating equations with nonparametric inputted values were developed by Zhou, Wan, and Wang (2008), and Wang, Cao, Du, and Zhang (2019) considered the case where the covariate is functional.Efromovich generalised the orthogonal series estimator in the cases of missing covariates (Efromovich 2011) and missing outcomes (Efromovich 2011).Apparently, all of these articles considered that either the covariate or the outcome variable is subject to missingness, but not both, and they all assumed a missing at random (MAR) mechanism, which in the case of one variable subject to missingness simply implies conditional independence between the missing indicator and the variable subject to missingness given all other fully observed variables.
In this paper, we develop an IPW local polynomial estimator which is suitable for both the covariate and the outcome variable subject to missingness.Such nonmonotone missingness is much more difficult to be modelled than missingness of a single variable or monotone missingness (also known as drop-out in longitudinal studies).One reason is that MAR is hard to justify in the nonmonotone missingness context as argued in the literature (Robins and Gill 1997;Little and Rubin 2002;Sun and Tchetgen 2018).Indeed, the commonly used multinomial model fails to include MAR mechanisms that are not MCAR without fully observed auxiliary variables; see more details in Section 2.1.Thus, it is more plausible to assume that the missing data mechanism is not at random (MNAR).There has been a growing number of recent works on nonmonotone MNAR data, e.g.Sadinle and Reiter (2017), Tchetgen, Wang, and Sun (2018), Chen (2020), Chen and Heitjan (2022), Malinsky, Shpitser, and Tchetgen Tchetgen (2021) and Scharfstein et al. (2022).Certain identification conditions are needed to make valid inference under MNAR.Sadinle and Reiter (2017) proposed the itemwise conditional independence conditions; Chen (2020) proposed the pattern graphs that can generate a class of identification conditions, including the complete-case missing value (CCMV) restriction (Little 1993;Tchetgen et al. 2018) and the transformed observed data restriction (Linero 2017).In particular, Tchetgen et al. (2018) showed that CCMV is compatible and easy-to-parametrise with the commonly used multinomial model.We thus adapt this model in our context.Also, following Wang et al. (2010) and Sun et al. (2020), we include fully observed auxiliary variables to better model the missing data mechanism.
It is well known that IPW-type estimators for general missing data problems do not perform well if the conditional probability of being a complete case is too close to zero (i.e. the weighting is too large) at some fully observed cases; see Kang and Schafer (2007) for an example of estimating the population mean in the context of MAR outcomes with fully observed covariates.The problem is investigated in that context by a number of papers (e.g.Robins, Sued, Lei-Gomez, and Rotnitzky 2007;Cao, Tsiatis, and Davidian 2009) but scarcely in general missing data patterns.In our context of covariates and outcomes both missing, the problem of the small conditional probability may be even worse, since the percentage of complete cases is generally smaller than that in the context of a single variable missing.Therefore, we further develop two estimation strategies to ensure that the resulting inverse probability of being a complete case is sample bounded above for fully observed cases.The first one is based on the constrained maximum likelihood estimation (CMLE).We show that it is asymptotically equivalent to the unconstrained (standard) MLE with probability one, and thus asymptotically efficient.The second one borrows the idea from Vermeulen and Vansteelandt (2015)'s estimating equations (EE) technique.Their strategy is based on estimating a parametric regression model and a missing data model simultaneously.Here we show how to bypass any parametric regression model and estimate the missing data model in our context.The similar bounded property was held by different parametric estimators (e.g. Robins et al. 2007;Cao et al. 2009;Tan 2010;Vermeulen and Vansteelandt 2015) in the context of MAR outcomes with fully observed covariates.
The paper is structured as follows.We introduce our model and data in Section 2. In Section 3, we briefly review the standard MLE, and then we introduce our CMLE and EE to estimate the unknown parameters in the missing data model.In Section 4, we construct an IPW local polynomial estimator and prove that it is asymptotic normal.A fully data-driven bandwidth selection is also provided.Simulations and a real data example are discussed in Sections 5 and 6, respectively.We conclude in Section 7. Technical proofs are given in supplementary material.

Model and data
Let X be the continuous real-valued covariate and Y be the outcome variable.Both variables are subject to missingness.Let R X and R Y be the missing indicators: R X = 1 (R Y = 1) if X (Y) is observed and R X = 0 (R Y = 0) otherwise.Any missing data pattern is allowed, i.e. for an individual, it is possible that the covariate is observed and the outcome is missing, or vice versa, or both of them are missing.We refer to this as nonmonotone missing data as in the literature.
It may happen in practice that there are other variables, e.g.demographical variables, related to the missing data mechanism, while they have little to do with the outcome variable and thus are not of primary interest.We call them auxiliary variables denoted by U. We assume that U is always fully observed and we include it to better model the missing data mechanism.As auxiliary variables, U is not included as covariates.That is, we are still interested in estimating m(x) = E(Y|X = x) instead of E(Y|X = x, U = u).For example, we are interested in estimating systolic blood pressure given one's body mass index where both of the variables are subject to missingness.In this case, if we believe some fully observed demographical variables (e.g.gender and age) are related to the missing data mechanism, then they may serve as auxiliary variables; see Section 6 for details.Inclusion of auxiliary variables has been also considered in the literature of nonparametric regression with missing Y, e.g.Wang et al. (2010) and Sun et al. (2020). Suppose The IPW idea (Robins et al. 1994(Robins et al. , 1995) ) can be used to construct regression estimators in general missing data contexts.We intend to use this idea to construct a nonparametric regression estimator for m, but we use a parametric model for the missing data mechanism, which is often done in the literature of a single variable missing (e.g.Wang et al. 2010).The main reason of using a parametric missing data model is that nonparametric techniques are less effective for multivariate covariates.However, we will introduce new estimation procedures for the missing data model (Sections 3.2 and 3.3) to produce stable estimates and mitigate the effect of model misspecification.

The missing data model
Since the numbers of different missing data patterns naturally follow the multinomial distribution, we seek a conditional multinomial distribution model for the missing data mechanism.The most widely used model to describe such a distribution is the following multinomial logistic model, for (r x , r y ) = (1, 1) and where 1 denotes the indicator function, D(x, y, u; T and the g k 's are known functions with unknown parameter α k , for k = 1, 2 and 3. Commonly used choices for the g k 's include linear functions of powers and/or products of the arguments.More complicated forms may be specified if practitioners have specific domain knowledge.This model has been used routinely in economics, transportation and many other contexts, but it has received little attention in the missing data problems (Tchetgen et al. 2018).The possible reason is that if we use (1) to model a MAR mechanism, the missing indicators would only depend on the auxiliary variables U.In other words, in the case of no auxiliary variables U, MAR under (1) necessarily results in (Robins and Gill 1997).
Indeed, it is difficult to model and justify MAR mechanisms in the nonmonotone missing data context as noted in the literature (Robins and Gill 1997;Little and Rubin 2002;Tchetgen et al. 2018).In the least restrictive setting without U, MAR would require the missingness of X to depend on Y and the missingness of Y to depend on X (Little and Rubin 2002, Ch. 1).Therefore, we assume that the missing data mechanism is MNAR throughout and allow the missing indicators to depend on all variables.
In the case of MNAR, certain identification conditions are needed to identify the joint distribution of (X, Y).Little (1993)'s complete-case missing value (CCMV) restriction is particularly suitable for nonmonotone missing data.When it is applied to (1), Tchetgen et al. (2018) showed that it is equivalent to requiring (2) That is, the g k 's depend only on the observed data according to the correspondence k More general identification conditions can be applied to the multinomial logistic model (1) as shown in Chen ( 2020), but it requires a specification of the pattern graph which may not be clear to practitioners.Throughout, we assume that the identification conditions (2) hold.Since any identification condition is not testable in missing data problems, it is useful to conduct a sensitivity analysis, which will be discussed this in Section 7. Next, we discuss different estimation strategies to estimate the unknown parameters in (1).

Maximum likelihood estimation
In this subsection, we briefly review the standard MLE of the multinomial logistic model proposed by Tchetgen et al. (2018) in the context of general nonmonotone missing data.Let π r x ,r y (x, y, u; α) = P(R X = r x , R Y = r y |X = x, Y = y, U = u; α).Since all π r x ,r y 's depend on both X and Y which may be missing in model ( 1), a direct likelihood of (R X , R Y |X, Y, U) is not available.Therefore, instead of estimating the π r x ,r y 's directly, we estimate the g k 's first.Note that for (r x , r y ) = (1, 1), where due to the CCMV restriction (2), We see that the right hand side of (3) depends only on observed data.For example, when (r x , r y ) = (1, 0), we have where the right hand side depends only on x and u, and for the missing data pattern (R X , R Y ) ∈ {(1, 0), (1, 1)}, X and U are always observed.
Thus, to estimate g r x ,r y associated with the missing data pattern (r x , r y ) = (1, 1), we restrict ourself to the subsamlpe and define the following log-likelihood of The MLE αMLE is defined as argmax α log L r (α).We then apply the above MLE for all of the missing data patterns (r x , r y ) = (1, 1) to obtain the estimators g r x ,r y (x, y, u; αMLE ) of g r x ,r y (x, y, u; α).It follows that the estimator of π 1,1 to be used for IPW estimation is defined as The above strategy was used in Tchetgen et al. (2018).
However, although the true π 1,1 is assumed to be bounded below by some δ > 0 in the IPW methodology, it is well known that if an estimator π1,1 = π 1,1 ( α) is too close to 0 at some fully observed cases (i.e.cases for which R X i = R Y i = 1), the inverse probability 1/ π1,1 would be extremely large, which leads to unstable IPW estimating equations; see an example of estimating the population mean in the context of MAR Y with fully observed X by Kang and Schafer (2007).To overcome this issue in our context of X and Y both missing, we next introduce our CMLE and EE to ensure that the resulting estimators of π 1,1 are bounded below for fully observed cases.

Constrained maximum likelihood estimation
In the context of MAR Y with fully observed X, the problem of an extremely small estimated P(R Y = 1|X) (instead of π 1,1 in our context) was first noted by Kang and Schafer (2007), and it is then studied by a number of papers, e.g.Robins et al. (2007), Cao et al. (2009), Tan (2010) and Vermeulen and Vansteelandt (2015).The goal in those papers is to estimate μ = E(Y), where parametric models for P(R Y = 1|X) and E(Y|X) are needed.Their estimators have many desired properties including the bounded below property of estimated P(R Y = 1|X).To the best of our knowledge, there is no work on nonparametric regression taking the issue of an extremely small estimated conditional probability of being a complete case into account.Yet, the problem may be even worse in our context of X and Y both missing, since the conditional probability of being a complete case The aforementioned works in the context of MAR Y with fully observed X cannot be directly generalised in our context to estimate π 1,1 , since they also require a parametric regression model, whereas our goal here is to propose a nonparametric regression estimator.However, a closer look at their various strategies shows that the key equation to ensure the bounded property is where P(R Y = 1|X; α) denotes a parametric model for P(R Y = 1|X).The above equation itself does not depend on any parametric model for E(Y|X).Therefore, our idea is starting from (6) to propose the estimators of π 1,1 that possess the bounded property in our context of X and Y both missing.Recall that for the missing data model, we use a parametric model following the literature on nonparametric regression with missing data (e.g.Wang et al. 2010).
Intuitively, in the MAR Y with fully observed X context, (6) requires that the sum of weights 1/P(R Y i = 1|X i ; α) for the observed cases (R Y i = 1) is equal to the sample size n so that each weight is bounded by the sample size.In our context of X and Y both MNAR, since we restrict to the subsample I r x ,r y to estimate α, the effective sample size is |I r x ,r y |, where |{•}| denotes the cardinality of the set {•}.Also, as a constraint to be included in MLE to ensure the bounded property instead of an equation to estimate the parameter, it is sufficient to use a flexible inequality than an equality, so the resulting estimates reduce more often to the standard MLE.Finally, recall from (3) that Combining the observations, we consider the following constraint, for the subsample I r x ,r y , where ξ > 0 is a user-specified tuning parameter controlling the region of the constraint.The estimator αCMLE is then defined to be argmax α log L r (α) at (5) with the constraint (7); and the CMLE of π 1,1 is defined as π 1,1 (x, y, u; αCMLE ) = 1/D(x, y, u; αCMLE ), after we apply the above CMLE to all of the missing data patterns (r x , r y ) = (1, 1).Next, we discuss the properties of αCMLE .We denote by α 0 the true value of α to be explicit.We have the following proposition whose proof is given in Appendix A.
We see from the above proposition that the constrained MLE is almost surely equivalent to the unconstrained MLE as n → ∞, which is known to be strongly consistent and asymptotically efficient if the model is correctly specified and standard regularity conditions are satisfied; see Appendix B for details.
In terms of the finite-sample performance, the constraint (7) ensures that the resulting estimator of π 1,1 is sample bounded below for fully observed cases.Specifically, we have the following result.

Proposition 3.2: For any fully observed cases, i.e. observations such that
, and for all possible values of the data, we have .
The proof is given in Appendix A. We see from this proposition that, for any given sample and a sufficiently small ξ , π1,1 = π 1,1 ( αCMLE ) is not too close to zero for fully observed cases, and thus the final regression estimator using inverse of π 1,1 ( αCMLE ) introduced in Section 4 is more robust.In practice, we should use a sufficiently small ξ , say 0.01, to obtain stable estimates.

Estimating equations
Now we introduce another technique that possesses the same bounded property by adapting the EE idea from Vermeulen and Vansteelandt (2015) to our context.Parametric models for E(Y|X) and π 1,1 are required in Vermeulen and Vansteelandt (2015)'s method in the MAR Y with fully observed X setting, whereas here we remove the need of a parametric model for E(Y|X) by using a parametric model for π r x ,r y .The key idea is still the generalisation of (6) in the context of missing both X and Y. Specifically, we assume that for k = 1, 2 and 3, g k has a constant intercept term, i.e. g k can be written as g k (x, y, u; α k ) = α k + p k (x, y, u; α k ), for some function p k and α k = (α k , α k ).The intercept term α k shall play the key role to ensure the bounded property.The choices for the g k 's can be many, among which the simplest and commonly used ones are linear functions with constant intercept terms.
Recalling that ( 6) is the key equation to ensure the bounded property, we try to keep it in our context.Note that partial derivatives are often used in methods of estimating equations, e.g. the score function in MLE, but they do not have to come from maximising some objective function.For the consistency of estimating equations, the key requirement is the unbiasedness at the true values of parameters (e.g.Newey and McFadden 1994).Based on these, for each missing data pattern (r x , r y ) = (1, 1), we consider the subsample I r x ,r y (defined at (4)) and define αEE as the solution to the following estimating equations The EE estimator of π 1,1 is then defined as π 1,1 (x, y, u; αEE ) = 1/D(x, y, u; αEE ).
Note that since we assume that g k has the constant intercept term α k , we have ∂g k (x, y, u; α k )/∂α k = 1, for k = 1, 2 and 3.The corresponding equation from (8) can be written as which is the analogue of (6) to our context and the version of ( 7) with ξ = 0 and ' ≤ ' replaced by ' = '.Thus, following the same argument proving Proposition 3.2, we can also prove the following proposition (see Appendix A for the proof), Proposition 3.3: For any fully observed cases, i.e. observations such that (R X i , R Y i ) = (1, 1), and for all possible values of the data, we have As to the asymptotic properties of αEE , we can prove that αEE is consistent and asymptotically normal using the theory on the generalised method of moments.However, αEE is not asymptotically efficient; see Appendix B for details.It is possible to derive the asymptotically efficient EE estimator, but the bounded property will not be held any more.On the other hand, αMLE is asymptotically efficient under standard regularity conditions; and αCMLE is too because of Proposition 3.1.Therefore, from the asymptotic efficiency point of view, αEE may be less preferred than αMLE and αCMLE .However, αCMLE and αEE provide more robust estimators than αMLE because of Propositions 3.2 and 3.3.We will see more specific numerical comparisons in Section 5.

IPW local polynomial estimation
Now that we have shown the missing data model followed by a few estimation strategies, we are able to apply them to estimate m(x) = E(Y|X = x) nonparametrically.In the standard context without missing data, a popular nonparametric technique to estimate m is the local polynomial estimator of order (Fan and Gijbels 1996).Specifically, for a given point x, the estimator m(x) = β0 = β0 (x) is the solution to the following estimating equations, where K is the kernel function, h > 0 is the bandwidth, K h (x) = h −1 K(x/h), and G(x) = (1, x, . . ., x ) T .In our case where both X and Y are subject to missingness, we define our estimator m(x) = β0 as the solution to the following IPW estimating equations, where throughout α denotes one of αMLE , αCMLE and αEE given in Section 3. It has a unique solution of closed form m(x) = e T 1 Ŝ−1 T, where e T 1 = (1, . . ., 0), Ŝ = ( Ŝk,k ) 0≤k,k ≤ and T = ( T0 , . . ., T ) T with Since the variables are subject to missingness, only fully observed cases (R X i = R Y i = 1) contribute to the estimating Equation ( 9), and the selection bias is adjusted by inverse of the conditional probability of being a complete case.It is well known that the kernel function K is not crucial to the local polynomial estimator, and we will use the standard Gaussian density function as K in the numerical studies.On the other hand, the bandwidth h is much more important.Before we introduce a data-driven bandwidth selection in Section 4.2, we prove the following asymptotic normality result for m, which will be helpful for us to choose the data-driven bandwidth.
We introduce the following notations: μ j = s j K(s) ds, ν j = s j K 2 (s) ds, S, S and S * are ( + 1) × ( + 1) matrices, whose (k + 1, k + 1) elements are defined by The proof is given in Appendix C. It can be seen that B and V are of the same order as in the case without missing data; see e.g.Fan and Gijbels (1996) and Delaigle, Fan, and Carroll (2009).More specifically, B is the same as without missing data and the effect of missingness is only on V, for which there is π 1,1 at the denominator.As shown in Appendix B, the √ n-consistency of α with α = αMLE , αCMLE or αEE can be proved using the standard theory (Fahrmeir and Kaufmann 1985;Newey and McFadden 1994).
Since only fully observed cases contribute to the estimating Equation ( 9), the information of partly observed cases is not used in regression (but it is used in estimating π r x ,r y ).This leads to a loss of efficiency.To address this issue in simpler contexts, e.g.only the outcome variable is subject to missingness, a large number of papers proposed different kinds of augmented inverse probability weighting (AIPW) methods; see e.g.Rotnitzky and Vansteelandt (2014) for a review.Many AIPW methods possess the so-called doubly robust, or even multiply robust, property.However, in the nonmonotone missing data context, developing AIPW estimation is much more difficult.Tchetgen et al. (2018) proposed to combine the multinomial logistic model and the pattern-mixture estimation (Little 1993) to develop AIPW estimation possessing the doubly robust property.However, this strategy may not be suitable for nonparametric regression in our context as we argue as follows.
First and foremost, it is because of the uncertainty of the joint distribution of the variables of interest (here X and Y) that we adopt a nonparametric method to model E(Y|X); but the AIPW method proposed by Tchetgen et al. (2018) requires a parametric model for the conditional distribution of (X, Y, U|R X = R Y = 1), which can be a difficult task with a high-dimensional U.If it is misspecified (which is often the case in practice), the gain of efficiency of using a parametric model for E(Y|X) is not guaranteed, which means the resulting estimator may be worse than our IPW estimator.Moreover, even if we assume a parametric model for (X, Y, U|R X = R Y = 1) and construct the AIPW estimating equations, they include complicated conditional expectations (note that unlike parametric models, we have the kernel function K in the estimating equations) which cannot be solved analytically and pose computational challenges.Therefore, we do not develop an AIPW method here.
Our method leaves the distribution of primary interest (X, Y) completely free, which is more flexible than general AIPW methods; and the bounded property of αCMLE and αEE stated in Propositions 3.2 and 3.3 stabilises the performance of the IPW estimators of m.We can expect that our regression estimators using αCMLE and αEE perform reasonably well as long as the number of the fully observed cases is not too small.

Bandwidth selection
To select the bandwidth h in our estimator of m, we adapt the plug-in (PI) ideas from Ruppert, Sheather, andWand (1995), Ruppert, Wand, Holst, andHöSJER (1997) and Fan and Yao (1998) to our context.Let 1 A be the indicator function of a set A and [a, b] be the interval where we intend to estimate m(x) = E(Y|X = x).With a slight abuse of terminology, we refer to B(x) and V(x) in Theorem 4.1 as the bias and the variance of m(x), and define the weighted asymptotic mean integrated square error as AMISE = {B 2 (x) + V(x)}w(x) dx, where w(x) = f X (x)1 [a,b] (x).We discuss how to choose PI bandwidth ĥPI for an IPW local linear estimator of m, i.e. = 1, since it is the most frequently used one in practice.
When = 1, AMISE can be written as and recall from the notations above Theorem 4.1 that μ 2 = x 2 K(x) dx and ν 0 = K 2 (x) dx.We define the optimal bandwidth h opt as the argmin of AMISE.This gives This optimal bandwidth h opt cannot be directly used since it includes the unknown 2 and .Next, we discuss how to estimate 2 and .
where m( 2) is an estimator of m (2) .The local polynomial estimator can also be used to estimate the derivatives of m.Specifically, following the literature we define m(2 , where e T 3 = (0, 0, 1, 0) and = 3, a local cubic estimator of m (2) computed with h 2 instead of h.This introduces another pilot bandwidth h 2 to be chosen.We could adapt the standard idea of choosing h 2 in the literature to our context by adding the weights in the estimation procedure.However, due to inverse of π 1,1 (even if using αCMLE or αEE ), h 2 may be poorly chosen.Therefore, we suggest to choose h 2 by the leave-one-out cross validation (CV) instead.Consider the following cross validation loss, where m(−i) denotes the version of m without using the ith individual's data.We define the CV bandwidth as where mcub is a local cubic estimator.The loss L CV with mcub approximates the mean squared error of mcub , which is of order h 8 + n −1 h −1 .Thus the CV bandwidth h 2 is of order n −1/9 .From Ruppert et al. (1995) we know that the mean integrated squared convergence rate of 2 ).Therefore, using h 2 n −1/9 is still a consistent although not optimal choice.Now we discuss how to estimate = b a σ 2 (x) dx.Note that under MNAR, σ 2 (x) necessarily depends on x.Let the pilot local linear estimator of m be mlin (x) = e T 1 Ŝ−1 T with = 1 and a bandwidth h lin instead of h.In the case of no missing data, Fan and Yao (1998) propose to estimate Adapting the idea from Fan and Yao (1998) in our context, we define the estimator of σ 2 (x) as σ 2 (x) = e T 1 Ŝ−1 Tσ using the bandwidth h σ , where = 1 and Tσ denotes T with Y i We choose h lin and h σ by CV: we define Plugging ˆ 2 and ˆ into (11), we obtain our plug-in bandwidth, ĥPI = {ν 0 ˆ /(nμ 2 2 ˆ 2 )} 1/5 .

Simulation study
In this section, we study the numerical performance of our proposed estimator m of order = 1.Let mMLE , mCMLE and mEE denote m given in (10) with α equal to αMLE , αCMLE and αEE , respectively.We use ξ = 0.01 in the constraint (7).To show the influence of missing data, we further consider the following two estimators: mora is the oracle local linear estimator of m using all of the data (X i , Y i ), i = 1, . . ., n with no missing data; mnai is the naive local linear estimator using only the fully observed cases (R X i = R Y i = 1) without any correction for missingness.The bandwidth selection for mMLE , mCMLE and mEE is introduced in Section 4.2; mnai 's bandwidth is selected following Section 4.2 with π 1,1 ≡ 1; the bandwidth selection of mora follows the procedure of Section 4.2 without missingness and with σ 2 replaced by the homoscedastic estimator proposed by Ruppert et al. (1995).The kernel function K used in any estimation is chosen to be the standard Gaussian density.
Our nonparametric regression estimators are able to capture nonlinear relationships in nonuniform sampling settings.Therefore, we choose to generate data from a few such models.The noise level is chosen to be reasonably challenging following those used in the literature.Specifically, we consider to generate (X, Y) from: where ∼ N(0, 0.5 2 ) and φ μ,σ (x) denotes the density function of N(μ, σ 2 ).The missing data model for the three models is chosen as a standard logistic linear model like the one in Tchetgen et al. (2018).Recalling π r x ,r y (x, y, u) where U ∼ N(0, 1).There are about 43%, 55% and 54% individuals with complete data, i.e. (R X = 1, R Y = 1), in model (i), (ii) and (iii), respectively.We choose to estimate the regression curve on [a, b] = [−1, 1], which covers approximately 98%, 95% and 88% of (X|R X = 1, R Y = 1) in model (i), (ii) and (iii), respectively.As to the missing data model, we consider both of the cases: (1) π r x ,r y is correctly specified, i.e. assuming g 1 (u; α) = α 10 + α 11 u, g 2 (x, u; α) = α 20 + α 21 x + α 22 u and g 3 (y, u; α) = α 30 + α 31 y + α 32 u.; (2) π r x ,r y is misspecified, i.e. assuming g 1 (u; α) = α 10 + α 11 u 2 , g 2 (x, u; α) = α 20 + α 21 x 2 + α 22 u 2 and g 3 (y, u; α) = α 30 + α 31 y 2 + α 32 u 2 .Further, to show that αCMLE and αEE are resistant to highly variable missing data mechanisms (Propositions 3.2 and 3.3) using the linear g k 's, inspired by Kang and Schafer (2007), we also generate data from: (iv) X and Y follow from (i) and the missing data model while we still assume that the g k 's are linear functions to estimate the parameters.This may happen in practice, since the true missing data mechanism is rarely known and possibly highly variable while practitioners often use linear functions to fit the model as default.It is unlikely that any practitioner could correctly specify this highly variable missing data model.The average of fully observed cases from model (iv) is around 37%, and we also estimate the regression curve on Table 1 summarises mean and standard deviation of the ISEs for all estimators in all models.From the table, we can see that when π r x ,r y is correctly specified, the differences between mCMLE , mMLE and mEE are small, and they all perform better than the naive estimator mnai , which supports our methodology.On the other hand, when π r x ,r y is misspecified, mCMLE and mEE outperform mMLE , which perform even worse than the naive estimator mnai for some cases.That is, the bias when using a misspecified missing data model in the unconstrained MLE is often larger than the bias when ignoring missing data.The results indicate that with the help of the sample-bounded property (Propositions 3.2 and 3.3), mCMLE and mEE are more robust than mMLE to the missing data model misspecification (models (i), (ii) and (iii)); mCMLE and mEE are more stable than mMLE in a highly variable missing data mechanism using the linear g k 's (model (iv)).We depict some figures for further illustration.In Figures 1 and 2, we depict what we call quartile estimated curves, corresponding to the 50th, 100th and 150th smallest ISEs computed from the 200 samples.Both figures show that mnai tends to underestimate m for m large.For the three IPW-type estimators mMLE , mCMLE and mEE , they perform similarly in model (i) with π r x ,r y correctly specified as shown in Figure 1, but mMLE is more variable than mCMLE and mEE in model (ii) with π r x ,r y misspecified as shown in Figure 2.   shows the point-wise mean curves of the four estimators computed from 200 samples in models (i) and (iv).It can be seen clearly that the naive estimator mnai always underestimate m; the three IPW-type estimators try to correct the bias, while in particular, mMLE overestimate m in the case of the highly variable π r x ,r y misspecified, because π −1 1,1 (X i , Y i ; αMLE ) are sometimes extremely large for some fully observed cases.

Real data illustration
To illustrate our method in practice, we apply mMLE , mCMLE , mEE and mnai to a dataset from National Health and Nutrition Examination Survey (NHANES), a large programme of health studies in the United States.We focus on a subset of the examination data in 2013-2014, which is available at the NHANES website.We are interested in modelling the relationship between X = body mass index (BMI, in kg/m 2 ) and Y = systolic blood pressure (first reading, in mm Hg).BMI is an indicator of obesity, while blood pressure is an indicator of heart health.It is known that they are positively correlated and may be causally associated.The sample size n = 9813, where 7104 (72.39%) individuals have complete data, 689 (7.02%) individuals have both X and Y missing, 7.72% of X is missing and 26.91% of Y is missing.Let U = (U 1 , U 2 ) = (gender(1 : male, 2 : female), age(in year)) be the auxiliary variables.They are recorded for all individuals.We first fit a standard multinomial logistic model, i.e. assuming g 1 (u, α) = α 10 + α 11 u 1 + α 12 u 2 , g 2 (x, u, α) = α 20 + α 21 x + α 22 u 1 + α 23 u 2 and g 3 (y, u, α) = α 30 + α 31 y + α 32 u 1 + α 33 u 2 , for the missing data model to produce αMLE , αCMLE and αEE .As in Section 5, we use ξ = 0.01 in the constraint (7).The estimated coefficients of X, Y and U using the three different estimation methods are summarised in Table 2. Then we apply the four regression estimators of Section 5 to the data on [q 1,α , q 2,α ], where q 1,α = 15.7 and q 2,α = 44.5 the 2.5% and 97.5% empirical quantiles of X|R X = 1, R Y = 1, respectively.The estimated curves are depicted in Figure 4.
From Table 2 we see that the auxiliary variable U 2 = age plays a significant role in explaining the missing data mechanism for this dataset.Although α12 using EE is much smaller than the other two, α10 (not given in the table) using EE is also much smaller than the other two so that the final estimators are still similar.We see from Figure 4 that X = BMI and Y = systolic blood pressure are indeed nonlinearly and positively associated.In Figure 4, the three IPW estimators mMLE , mCMLE and mEE show very similar results, while mnai is slightly larger than the other three when X is small.This indicates that the estimated missing data model is quite stable, and because the percentage of missing data is quite low, the difference between the IPW estimators and the naive estimator is not large.

Discussion
In the presence of missing both covariates and outcomes, we develop an IPW local polynomial estimator and prove its asymptotic normality under the MNAR assumption.The multinomial logistic model with the CCMV restriction is used to model the missing data mechanism.To deal with possibly too large weights used in IPW estimation, we propose CMLE and EE estimation procedures to ensure the weights are bounded above by the sample size.Simulation results show that CMLE and EE perform better than the standard MLE if the missing data model is misspecified or highly variable.
The CCMV restriction is used to identify the missing data mechanism, but it is not always plausible.Following the idea from Chen (2020), we can conduct a sensitivity analysis to see the effect when the missing data mechanism deviates from the CCMV restriction.Specifically, we perturb the Equation (3) by = expit{g r x ,r y (x, y, u; α)} exp(δ T r ) , where δ T is a given vector controlling the amount of perturbation, r is the missing part of (x, y) in the pattern (r x , r y ).The estimation for α remains the same as in Section 3. Other extensions are possible.For example, it is straightforward to extend our methodology to the context of a multivariate covariate, but its performance suffers from the curse of dimensionality and more missing data patterns.Also, the function π 1,1 , also called the propensity score, has wide applications in survey sampling and causal inference.Thus, it is of interest to deliver the techniques of producing stable estimated π 1,1 into other contexts, which can be helpful in certain circumstances.
[a, b] = [−1, 1].We consider sample sizes n = 300 and n = 1000.We generate 200 random samples in each setting.A vector of zeros is used as the initial point in computing αMLE , αCMLE and αEE .The results are evaluated via the integrated square error ISE = b a { m(x) − m(x)} 2 dx, where m denotes any estimator of m.

Table 1 .
Simulation results for all estimators in all models with the π r x ,r y correctly specified (CS) or misspecified (MS).
Note: The numbers shown are 10 2 × mean (standard deviation) of the ISEs computed from 200 samples.

Table 2 .
The numbers shown are estimated coefficients using different estimation methods with 95% confidence intervals obatained by bootstrap in brackets.