Nonparametric regression with responses missing at random and the scale depending on auxiliary covariates

Nonparametric regression with missing at random (MAR) responses, univariate regression component of interest, and the scale function depending on both the predictor and auxiliary covariates, is considered. The asymptotic theory suggests that both heteroscedasticity and MAR mechanism affect the sharp constant of the minimax mean integrated squared error (MISE) convergence. Our sharp minimax procedure is based on the estimation of unknown nuisance scale function, design density and availability likelihood. The estimator is adaptive to the missing mechanism and unknown smoothness of the estimated regression function. Simulation studies and real examples also justify practical feasibility of the proposed method for this complex regression setting.


Introduction
Consider a pair of random variables (X, Y). The association between response Y and predictor X is often studied by a regression model where m(x) := E{Y | X = x} is the regression function and σ (x) is called scale or volatility. In nonparametric regression estimation, almost no shape assumption is imposed on function m(x). There is a vast nonparametric literature devoted to regression using various approaches such as kernel, local polynomial, spline, and wavelet; see Eubank (1999), Wahba (1990), Fan and Gijbels (1996), Efromovich (1999), Györfi, Kohler, Krzyżak, and Walk (2002), Härdle, Werwatz, Müller, and Sperlich (2004), Wasserman (2006) and Tsybakov (2009). Nonparametric research concerns asymptotic properties over some functional class. A popular choice is Sobolev class (3) since Pinsker (1980) discovered the exact lower bound of minimax risk and inspired the search for sharp adaptive procedures. An exciting result is that efficient nonparametric regression estimation is possible even without estimating the scale function. Efromovich and Pinsker (1996) proposed such estimator achieves optimal convergence under mean integrated squared error (MISE) criterion, MISE(m; m) = E{ 1 0 [m(x) − m(x)] 2 dx}, wherem(x) is an estimator for m (x). It also implies that nonparametric regression problem is not sensitive to heteroscedasticity. However, the situation changes dramatically when volatility is also driven by some process besides predictor of interest, where the scale involves a D-dimensional random vector Z := (Z 1 , Z 2 , . . . , Z D ) of auxiliary covariates. Estimator ignoring this complicated heteroscedasticity may keep rate-optimal but will no longer be sharp minimax due to an inflated convergence constant. Efromovich (2013) solved the sharp adaptive estimation problem under this complicated regression setting by introducing scale function weighting in Fourier series estimates.
In this paper, we will consider the above model (2) in a more general setting when responses cannot be observed directly. Missing data is a ubiquitous and challenging complication accompanying the development of statistical analysis. Conventional approaches include likelihood-based parametric inference and weighting-based semiparametric methods; see Tsiatis (2006), Enders (2010), Molenberghs, Fitzmaurice, Kenward, Tsiatis, and for (x, z) ∈ [0, 1] D+1 , y ∈ R and a ∈ {0, 1}. We can see a conditional distribution of observation in complete cases is biased by MAR mechanism due to the factor before the underlying distribution of data (biasing function) involving covariates, Fortunately, this missing mechanism does not affect regression estimation since conditional distribution of response is the same as underlying one in H-sample. For univariate setting (1), Efromovich (2011) proposed efficient complete-case nonparametric estimation that attains corresponding lower bound of minimax risk. It is promising that we would establish sharp estimation theory for our setting. The rest of the paper is organised as follows. In Section 2, an asymptotic lower bound for minimax MISE risk convergence is established. Section 3 is devoted to asymptotically efficient estimation procedures along a hierarchy of adaptation. In Section 4, we test smallsample performance of proposed estimator in several numerical simulations as well as real data examples.
Let us briefly discuss some terminology of the minimax approach used throughout the paper. Asymptotic analysis can be explained by a minimax game with four players (nature, oracle, dealer and statistician). The rule is defined by model assumption and payoff criterion (MISE). The dealer sets nuisance functions and parameters of the class for regression function and then shows them to nature as cards dealt in a poker game. Nature picks a regression function from the dealt class to generate an H sample and modify it to an M-sample for all participants. The chosen regression function maximises nature's payoff (MISE), while other players propose estimators based on M-sample to minimise it. Actions of nature and dealer are based on knowledge of opponent's possible behaviour, which lead to an equilibrium of minimax risk. Although dealer's estimators are pseudo-estimators due to knowledge about nuisance functions and parameters of considered class, this minimax risk is a reasonable benchmark for statistician's estimators. Dealer's optimal strategy can be quantified in a game with the oracle, who knows even the underlying estimand in addition to the everything dealer knows. Dealer proposes a lower bound guess for oracle's estimator using as little privileged information as possible. If oracle cannot achieve the target with reduced information, dealer raises the guess. For example, a zero guess is impossible except for the underling regression. When oracle's information is reduced to the dealer's set, we obtain an efficient dealer's estimator. If it is difficult for adaptation, we can pick a (more informative) oracle's estimator with the same risk that is easy for statistician's estimator to mimic its performance. Interested readers are referred to Berger (1985), Lehmann and Casella (1998) for more details about statistical decision theory.

Lower bound for minimax risk
The minimax risk depends on smoothness of the regression function and Sobolev class (ellipsoid) is usually considered in nonparametric literature, where θ j := 1 0 m(x)ϕ j (x) dx is the Fourier coefficient with respect to jth cosine basis function on unit interval [0,1], ϕ 0 (x) = 1, ϕ j (x) = 2 1/2 cos(π jx), j = 1, 2, 3, . . . The number Q controls power or energy of member functions and parameter α determines their smoothness. The case of α ≥ 1 is considered in this paper and α > 0.5 implies absolute summability of θ j 's. It may be appropriate to study the estimation problem in the vicinity of a particular pivot function instead of a global solution, which is the local minimax approach introduced by Golubev (1991). Moreover, a shrinking class minimax solution corresponds to a sequence of local classes that converges to the pivot as we get more observations. Here, we can introduce a more general family covering all the above-discussed classes. Given lower frequency cutoff M n and tail bound ρ n such that 0 ≤ M n ≤ n 1/(2α+1) / ln 2 (n) and ρ n > n −1/(2α+1) ln(n), define where θ 0,j := 1 0 m 0 (u)ϕ j (u) du is the jth Fourier coefficient of pivot regression m 0 satisfying squared integrability and finite uniform norm sup x∈[0,1] |m 0 (x)| < ∞. We use I(·) for indicator throughout the paper.
Let us make some comments about this family of functional classes. Its member functions share the low-frequency part of pivot m 0 . The low-frequency cutoff M n also controls L 2 norm since tail power is at most of order M −1 n . Note that M n is at most of a smaller order than the classic number of frequencies for rate-optimal estimation, n 1/(2α+1) . A shrinking class corresponds to ρ n = o n (1), where underlying regression is uniformly close to the given pivot in a contracting L ∞ ([0, 1]) ball. Alternatively, local shrinking can be modelled by a perturbation approach, m(x) = m 0 (x) + j>0 θ j ϕ j (x). However, it is more restricted than our formation since it requires the same smoothness on pivot m 0 . Clearly, classic Sobolev class E(α, Q) corresponds to the case F(0, ∞, 0, α, Q) without pivot function and additional tail constraints. Now we specify assumptions for our regression model (2). To simplify the analysis of dealer's lower bound for minimax MISE risk, we require normality and independence, which can be relaxed in adaptation.
Assumption 2.1: The error term follows standard normal distribution and is independent of covariates (X, Z).
All the involved functions except our regression of interest are nuisance functions such as joint design density f X,Z (x, z), scale function σ (x, z) and availability likelihood function w(x, z). Further, introduce a function I(x) := [0,1] D f X,Z (x, z)w(x, z)σ −2 (x, z) dz, which catches key characteristic of our model setting as we shall see shortly.

Assumption 2.2:
The joint design density f X,Z (x, z) and availability likelihood w(x, z) are supported on [0, 1] D+1 . Functions f X,Z (x, z), w(x, z) and σ (x, z) are bounded below from zero and also bounded above. In addition, w(x, z) can not exceed 1. Function I(x) is Riemann integrable on [0, 1].
Given M-sample (X, Z, AY, A) n , denotem * (x) as a representative dealer's estimator that exploits the privileged knowledge of parameters (m 0 , ρ n , M n , α, Q) of the dealt class F and the chosen nuisance functions (f X,Z , σ , w). There are two conventional approaches to evaluate the lower bound for minimax MISE in nonparametric curve estimation. One is via Le Cam's asymptotic equivalence with Gaussian filtering model, but it is limited under missing data setting; see Brown and Low (1996) and Efromovich and Pinsker (1996). The other calculates minimax risk under uniform local asymptotic normality (ULAN), which is applicable in a variety of situations (Efromovich 2000). Fortunately, the second method also works for our estimation problem.
The first part of above result establishes a lower bound for minimax risk of dealer's estimators while the second part verifies its sharpness. Define J (α, Q, d) := [P(α, Q)] −(2α+1)/(2α) /d. Then we can rewrite (4), The quantity J (α, Q, d) is the nonparametric counterpart of Fisher information in the classic Cramér-Rao bound of parametric inference. The well-known exponent 2α/(2α + 1) indicates a slower convergence in nonparametric problem. The renown constant P(α, Q) is discovered in the seminal work of Pinsker (1980) and it is determined by the implied Sobolev ellipsoid in class F.
The form of lower bound is general over a variety of nonparametric estimation problems and the coefficient of difficulty characterises the model setting. Here, d := d(f X,Z , σ , w) in (5) is a functional of underlying nuisance functions. For the pure univariate setting (1) with MAR response, Efromovich (2011) established the sharp lower bound with the coefficient of difficulty d 1 := 1 0 σ 2 (x)[f X (x)w(x)] −1 dx. Sharp adaptive estimation exists even without using the scale due to this relatively simple setting. Let us explain why any estimator ignoring auxiliary covariates in the scale incurs a loss of efficiency under the studied model (2). Consider a simpler missing mechanism depending on X only for comparison, that is, a univariate w(x). Then the underlying coefficient of difficulty (5) dx, which is smaller than the coefficient of difficulty implied from a univariate estimation procedure unless homoscedasticity, d 1 : They can not attain the lower bound (4), a loss of efficiency.

Sharp minimax estimation
Our aim is a data-driven procedure adapted to unknown smoothness of regression function of interest, missing mechanism and underlying nuisance functions. In nonparametric estimation, we can relax the distribution assumption on error terms, Only conditional independence is required and the normality in Assumption 2.1 is replaced by a finite fourth moment condition, which is important in adaptation.
As noted in the Introduction, we establish a sharp lower bound (4) via a Pinsker-type linear minimax oracle, which is impractical to suggest a data-driven estimator. We want to find an oracle that is more informative but easier for adaptation. A good starting point is a blockwise shrinkage procedure with smoothing coefficients using underlying functionals of regression m(x) based on the idea dating back to Efromovich and Pinsker (1984). Let {B k , k = 1, 2, . . .} be a set of ordered blocks partitioning nonnegative integers such that max{j : j ∈ B k } < min{j : j ∈ B k+1 } and L k be the number of frequencies in the block B k . We propose a sharp minimax blockwise shrinkage oracle's estimator employing equal smoothing weight for frequencies in the same block (see Supplementary Materials), whereθ j estimates Fourier coefficient θ j = 1 0 m(x)ϕ j dx of the underlying regression and smoothing coefficient μ k := k /( k + dn −1 ) involves the underlying coefficient of difficulty d (5) and Sobolev functional k := L −1 k j∈B k θ 2 j , which gives the average power of regression function m(x) over block B k .
Then we can establish adaptive procedure, statistician's estimator, based on this oracle's estimator. Note that the cutoff K n in (7) provides required number of estimated Fourier coefficients for rate-optimal estimation, K n k=1 L k > n 1/(2α+1) ln(ln(n + 20)). Although it depends on underlying smoothness parameter of class F, we can directly set α = 1 to obtain a large enough cutoff covering all classes with α ≥ 1. If statistician has some knowledge about smoothness, for example, α ≥ α 0 , a smaller adaptive cutoff can be solved from the condition for K n . A general framework of adaptive blockwise shrinkage procedure will be achieved by standard plug-in technique, which replaces underlying functionals and nuisance functions with statistics, Here, hard thresholding technique is also employed to further reduce variation by removing noise. We will propose corresponding component statistics of estimator (8) based on knowledge about nuisance functions. Let us discuss details of above blockwise shrinkage procedure. For convenience, introduce some increasing sequences, b n := ln(n + 20) , c n := ln(b n ) and r = r n := n/(19c n ) , where x denotes the greatest integer that is less than or equal to x. We already used c n in the condition of cutoff K n to take care of any constant factor. Note that r is the number of observations in a subsample. As you will see shortly, 19 is the largest number of subsamples used to construct component statistics besidesθ j . You can also choose a more balanced partition based on the number of required subsamples, for example, set r := n/(2c n ) for the case of known nuisance functions in Section 3.1. From now on, it is assumed that n is large enough such that r > 3. The choice of block scheme is rather flexible and we employ a exponentially growing one introduced by Cai, Low, and Zhao (2009). Set block length L k := 1 for low frequencies and L k := (1 + b −1 n ) k for k > b n since low frequency Fourier coefficients are usually significant with respect to noise for common well-behaved functions. Then our block scheme consists of b n singleton blocks and growing blocks B k : Note that block length {L k } is a slowly increasing geometric sequence with the largest one L K n = O(n 1/3 ln(ln(n))/ ln(n)), which is a little smaller than the required number of frequencies for rate-optimal estimation. It is also easy to show the total number of blocks K n is of order ln 2 (n).

Known design, availability likelihood and scale function
Let us begin with the case that all nuisance functions are given. Then the coefficient of difficulty is also known because it is a functional of nuisance functions. Although the suggested estimator does not exploit dealer's information such as parameters of class F, it can be regarded as a dealer's estimator.
As noted before, mutually independent sequences of component statisticsθ j 's and k 's are based on distinct subsamples of (X, Z, AY, A) n with r observations. We estimate Sobolev functional k by a U-statistic, All the remaining data are used in estimating Fourier coefficients θ j 's, the backbone of a series estimator,θ where function where the set of irrelevant frequencies up to b n , J −j = {0, 1, . . . , b n }\{j}, is enough for the variation reduction purpose due to the fact that Fourier coefficients decays really fast. Without this operation, convergence constant of mean squared error ofθ j will be inflated by additional 1 0 [m(x)] 2 dx (L 2 norm of underlying regression), a larger induced coefficient of difficulty than underlying d (5). (2) with regression error satisfying (6). Design density, availability likelihood and the scale are given and Assumption 2.2 holds. Then the blockwise shrinkage regression estimator (8) with estimated Fourier coefficientθ j defined in (10), estimated Sobolev functional k defined in (9) and underlying coefficient of difficulty (5),d = d, is adaptive to the studied functional class and sharp minimax, that is,

Unknown nuisance functions
Now we consider a fully adaptive estimator, that is, the statistician does not know any nuisance functions, scale σ (x, z), joint design density f X,Z (x, z) and availability likelihood w(x, z). Since they are functions of covariates, let us introduce some notations for a multivariate projection series estimator. Define a tensor-product cosine basis on [0, 1] p , where η s is Fourier coefficient with respect to the tensor-product cosine basis.
Note that joint design density and availability likelihood appear together in component estimators of the previous section. We can use an estimator for the product Alternatively, separate estimators for f X,Z (x, z) and w(x, z) also work but it only introduces unnecessary complexity in asymptotic analysis. Some regularity conditions on joint density f X,Z,A (x, z, 1) (or conditional density f X,Z | A (x, z | 1)) are required to keep the overall performance of our blockwise shrinkage estimator from estimation errors emerging in further adaptation. For example, introduce a (D + 1)-dimensional analytic class with finite positive numbers Q and β k , k = 0, 1, . . . , D, Here, parameters (β 0 , β 1 , . . . , β D , Q) define a region where a complex-valued function f (x, z) ∈ A can be expanded into a convergent power series; see Efromovich (1999). In Supplementary Materials, you can find good approximation properties of projection series estimators for this analytic class. For simplicity in asymptotic analysis, introduce nine independent estimators for joint density f X,Z,Y (x, z, 1), that is, for t = 1, . . . , 9, (13) where the cutoff J a := b n c n is presented by uniform norm notation (i, s) ∞ := max(i, s 1 , . . . , s D ). Note that J a for analytic class density function is only a little larger than that of the rough estimatorm −j (x) (11) thanks to good convergent rate associated to A. Further, we employ adaptive truncation to avoid divide-by-zero problem since they are used in the denominator of many component statistics. We also impose a little bit regularity assumption on the scale such that statistician is supposed to know the range, c * ≤ σ 2 (x, z) ≤ c * , for some positive constants c * and c * . According to regression model (2), the relation E{[Y − m(X)] 2 | X, Z} = σ 2 (X, Z) motivates the following truncated projection estimator for squared scale function σ 2 (x, z). whereσ estimates underlying σ is = [0,1] D+1 σ 2 (x, z)ϕ i (x)ψ s (z) dx dz with a rough estimate for regression functioñ Together with density estimator, we have a plug-in estimator for coefficient of difficulty, Next we update each component of estimated Fourier coefficientθ j and Sobolev functional k with estimated nuisance functions. The variation reduction term (11) inθ j becomes An independent copyσ 2 (x, z) of above scale function estimator (14) is required for mutually independence betweenθ j and k . Since there is no smoothness assumption on the scale, we can further improveθ j with a smoother scale function estimator via Fejér approximation, which preserves the original range; see Bary (1964).
It is also used in estimator for , z) dz. Finally, we can propose our fully adaptive estimators for Fourier coefficient θ ĵ and Sobolev functional k over block B k Let us comment on the above result. If design density or availability likelihood are known, for example, in some controlled experimental design with certain data collection or sampling rule, our procedure with those underlying functions is also sharp minimax, which may have better performance in small sample. Details are left in the Supplementary Materials. Note that component statistics such as the scale only require logarithmic cutoffs as the case of known density and availability likelihood. Analytic class (12) is easy for adaptation since it implies infinitely differentiability. A nice discussion about alternative weaker regularity conditions can be found in Efromovich (1999), Wasserman (2006) and Tsybakov (2009). Here, we consider a variant of multivariate Sobolev class, which requires smoothness in each coordinate to match the number of variables due to curse of dimensionality. Introduce a class for p-variate functions with isotropic smoothness, Specifically, we assume a joint density f X,Z,A (x, z, 1) belonging to a (D + 1)-variate Sobolev class S(D + 1, Q). Note that our regression function of interest m(x) also belongs to this class with p = 1 due to smoothness parameter α ≥ 1.
Since functions from Sobolev class (21) are less smooth than those from analytic class (12), a larger cutoff of projection estimates is required to preserve the convergence rate of plug-in components in our blockwise shrinkage regression estimator (8). Let us first introduce eight truncated projection density estimators with a new cutoff J s = n 1/3(D+1) , that is, for t = 1, . . . , 8, (22) Then coefficient of difficulty estimate is an analogue of (16) in the previous analytic case, Also, subsamples of corresponding component statistics such asm 1 (x) andσ 1is inσ 2 1 (x, z) following (14) are renumbered because of the change in (22).
Next we update component statistics of estimated Fourier coefficientθ j . A polynomial level cutoff is used in variation reduction term with index set J s,−j = {0, 1, . . . , n 1/3 }\{j} due to a slower convergence rate associated with S (21), Fejér approximation of inverse squared scale follows the same procedure in (18) with an independent copy of estimatorσ −2 1 (x, z). Note that approximation properties of projection density estimator (22) slow down under Sobolev class assumption such that the proof of analytic case can not go through forθ j . So we want to control the bias of density estimator to exploit convergence rate from other components ofθ j 's and k 's. Let us define a new cutoff J * s = n 1/2(D+2) , which is a much larger than J s except the case of a single auxiliary covariate (D = 1). Then we introduce three more truncated projection series estimators. For t = 1, 2 and 3, , z) dz while the other two in the estimator of Sobolev functionals k . Since the choice J s is optimal, the larger cutoff J * s will incur a worse convergence rate. However, as you will see in the proof, implied reduction in bias is the key in checking sufficient conditions for sharp minimax estimator. We also employ variation reduction technique used inθ j to further improve the U-statistic type estimator for k . Hence, two more estimates are introduced, where J s,−j is the same index set in (24) form −j (x).
Then we can propose new estimators for Sobolev functional k k := 2 r(r − 1) and Fourier coefficient θ ĵ The following result modifies Proposition 3.2 for nuisance functions from Sobolev class. We impose a mild smoothness assumption on scale function in order to simplify the verification of sufficient condition with the rough scale estimator. A possible solution with weaker assumption of scale function can be found in Efromovich (2013). (2) with regression error satisfying (6). Assumptions 2.2 holds and both joint density f X,Z,A (x, z, 1) and the scale σ (x, z) belong to Sobolev class S (21). In addition, there are two finite positive constants, c * and c * , such that c * ≤ σ 2 (x, z) ≤ c * . Then the blockwise shrinkage regression estimator (8) withθ j defined in (27), k defined in (26) andd =d defined in (23) is adaptive and sharp minimax.

Extension: a general additive model
Let us consider a natural extension of the regression model (2) by introducing a nuisance additive term g(z), that is, a general additive model where g(z) is integrated to zero on [0, 1] D for identification issue. It may look like a little deviation from our topic due to omitted variable problem for univariate procedure. The additive term will not affect asymptotic property about regression estimation of interest in homoscedastic setting (Hastie and Tibshirani 1990). Fortunately, for our heteroscedastic model (28), dealer's lower bound of minimax risk in Theorem 2.1 is still valid since both oracle and dealer can reduce this additive model estimation problem to origin one (2) by subtracting the known additive component from response. In this section, we will briefly discuss sharp adaptive estimation for the case of analytic joint density following Proposition 3.2. Some regularity condition on unknown function g(z) is required for adaptation. Suppose it also belongs to Sobolev class S (21) of D-variate functions. Since a simple projection estimator for g(z) uses one more density estimate, ten independent copies of f X,Z,A t (x, z, 1) are required following (13). Now we can control variation in scale function estimator from both a regression m(x) of interest and an additive component g(z) by estimating a general regression function q(x, z) := m(x) + g(z) in place of statisticm 1 (x) (15) of previous setting (2), Then we have a new estimate for Fourier coefficients of the squared scale functioñ Plugging into the procedures (14) and (16), we arrive at a new squared scale function estimateσ 2 1 (x, z) and a coefficient of difficulty estimate further adapted to an unknown additive component,d Similarly, the variation reduction term inθ j also includes a D-variate projection series estimate for additive component g(z) besidesm −j (x), wherem −j (x) is the same estimator (17) based on the subsample of observations from 12r + 1 to 13r.
Note that a polynomial order cutoff is required for additive component function from Sobolev class S, that is, an index set of J g := {0, 1, . . . , J g } D \{0} D with J g = n 1/3D . For Fourier coefficient estimator, we use a smooth Fejér approximation (18) in the scale weighting from a distinct copy of truncated projection series estimateσ 2 (x, z). Finally, set Our estimation procedure can still achieve the same dealer's sharp lower bound (4) in spite of an unknown additive regression of auxiliary covariates. Let's make some comments about complete-case estimation. Most statistical packages use listwise deletion by default in analysis of data with missing observations. Completecase analysis is reliable under many situations and its justification can be found in Little and Rubin (2019), Molenberghs et al. (2014) and Tsiatis (2006). Note that our blockwise shrinkage procedure provides a complete-case sharp minimax estimator with all component statistics calculated based on N = n l=1 A l complete observations (X, Z, AY, A = 1) N(n) of M-sample of size n. We can further optimise our estimation procedure with some practical considerations. For example, set regression estimator to zero when number of complete-cases is too small (N = O(b n )). It is easy to show that the probability of such a few complete cases is negligible with respect to the convergence rate of minimax bound.

Numerical studies
Asymptotic theory indicates that efficient data-driven estimation is possible for the basic model (2) and a general additive model (28) with responses missing at random. Numerical studies can shed light on its feasibility for small datasets.
Let us briefly describe the statistical experiments in the numerical analysis as well as involved estimators. We consider model (2) with just one auxiliary covariate (D = 1), Y = m(X) + σ (X, Z) . Two candidates of regression m(x) are selected from the recommended corner functions of Efromovich (2018). Here, we will discuss the results for a bell shaped 'Normal' function while those for a more complicated 'Bimodal' one and other experiment details can be found in the Supplementary Materials. The error term is an i.i.d standard normal sequence. The scale function, σ (x, z) = exp(λz/2), depends only on z in order to emphasise how auxiliary covariate in the scale affects regression estimation, where parameter λ ∈ {1, 2, 3} controls heteroscedasticity. Covariates X and Z are independent of each other with a uniform design density, f X,Z (x, z) = I((x, z) ∈ [0, 1] 2 ). Since we also want to compare the impact of auxiliary covariate-driven heteroscedasticity on estimation with univariate procedures, MAR mechanism involves predictor X only. Availability likelihoods, w 1 (x) = 0.5 + 0.4x, w 2 (x) = 0.4 + 0.4x, w 3 (x) = 0.3 + 0.4x, are used to test robustness under different missing intensity. It is easy to show that they correspond to scenarios with 30%, 40% and 50% observations missed on average under uniform distributed predictor. For the simulations about additive model (28), we choose three polynomial additive components, g 1 (z) = z − 1/2, g 2 (z) = z 2 − 1/3 and g 3 (z) = z 3 + z − 3/4, which are integrated to 0 on unit interval for identification issue.
Here, the proposed fully adaptive one in Section 3.2 is our Statistician's estimator (Sestimator). As noted earlier, the estimator knowing underlying nuisance functions in Section 3.1 can be regarded as a dealer's estimator (D-estimator) and it will be a good benchmark to measure the performance of S-estimator. Also, some modifications are recommended for small-sample performance. We will also take two good univariate procedures for comparison, a series E-estimator and a kernel K-estimator. E-estimator (Efromovich 2018) is a simple and reliable series estimator that performs well for both directly observed data and missing and modified data. Kernel approach is popular in nonparametric studies and Cheng (1994) showed that a complete-case variant of Nadaraya-Watson type estimator provides satisfactory performance under MAR response. Asymptotic theory indicates that all above methods achieve the same optimal convergence rate of MISE risk but those univariate ones ignoring auxiliary covariates in scale function suffer a loss of efficiency.  it is not easy to imagine underlying regression without the guidance of underlying 'Normal' corner function (solid line). Some points look like outliers in the left XY-scattergram but we know it is just an illusion caused by heteroscedasticity from auxiliary covariate. For example, the suspect outlier near top right corner of the XY-scattergram corresponds to the highest point around Z = 0.6 in the ZY-scattergram, which is justified by monotone scale function increasing in z. This additional volatility from auxiliary covariate harms those good univariate procedures. However, we can see the impact of responses missing at random on regression estimation may be two-edged. On the one hand, missing mechanism reduces the available sample size for estimation. On the other hand, it may also remove those spurious outliers misleading pure univariate estimators. Here, the performance of K-estimator is even improved in M-sample due to a better fit in right tail.
Two diagrams in the bottom of Figure 1 also give corresponding integrated squared errors (ISE), which are calculated as 1 0 [m(x) − m(x)] 2 dx for each estimatem(x). These values may conflict with our visual intuition of relative performance since ISE is a global measure. All estimates succeed to catch the symmetric bell shape of 'Normal' corner function in this particular simulation. The dot-dashed curve of S-estimate closely follows the short dashed curve of D-estimate, which shows good adaptation of our data-driven estimator. Comparing the two scenarios, our estimator succeeds to handle missing mechanism. E-estimate and K-estimate also do a good job and their worse tail behaviour presents a loss of efficiency due to ignoring heteroscedasticity from auxiliary covariate. Note that the behaviour of K-estimate also reflects the different logic from orthogonal series estimator. We want to remind readers of randomness in simulation and a large variability in outcomes can be expected for small samples if you repeat the simulation multiple times. So we want to use intensive Monte Carlo studies to confirm our assertions based on large sample theory. Let us introduce the setting of Monte Carlo simulation study about the four estimates (D-estimate, S-estimate, E-estimate, and K-estimate). In Table 1, we consider five underlying H-sample sizes, n ∈ {100, 200, 500, 1000, 2000}. Together with nine combinations of candidate scale functions and availability likelihood functions, there are 15 scenarios on underlying H-samples and 45 scenarios on missing data M-samples. Pairwise relative performance is measured by average ratio of their ISEs over 1000 simulations. The first cell in each block gives a relative performance of S-estimate with respect to D-estimate while the remaining two numbers in the first row compare with E-estimator and K-estimator so that we can check the suggested advantage of S-estimator over pure univariate procedures. The second row is devoted to relative performance of S-estimator against D-estimate under different additive components.
Let us discuss the results for a 'Normal' regression function in Table 1. The relative performance of S-estimate against D-estimate, the first number in each block, converges as sample size increases along each row, which supports the assertion that statistician's estimator mimics the performance of dealer's estimator. Since three ratios in the second row for general additive model also present a similar trend, we can say S-estimator succeeds to adapt to unknown nuisance additive component. These ratios slightly increase under severe heteroscedasticity and missing strength, where dealer's knowledge of nuisance functions is more valuable. Then we check the relative efficiency against pure univariate procedures ignoring heteroscedasticity due to auxiliary covariate. Our S-estimate dominates both E-estimate and K-estimate shown by the second and third numbers in each block. This advantage grows in sample size as well as heteroscedasticity, which confirms the power of S-estimator suggested by asymptotic theory. Taking into account the complicated structure of S-estimate, it does a good job in adaptation to missing mechanism. The ratios of E-estimate and K-estimate also reveals the characteristics of different nonparametric methods. Now let us consider application of the proposed methodology on a real data example. Ozone is an important trace gas in the atmosphere. Stratosphere ozone helps to protect the earth's surface from harmful ultraviolet radiation. However, low-level ozone is a harmful pollutant involved in photochemical smog. So monitoring and controlling lowlevel ozone is an environmental topic of great interest in both mass media and scientific researches. We will analyse a small dataset of ozone level at 147 sites in the midwestern region from Harezlak, Ruppert, and Wand (2018), which is a subset of the cooperative research programme of the National Institute of Statistical Sciences (NISS) and the U.S. Environmental Protection Agency (EPA); see Nychka, Piegorsch, and Cox (1998). From the distribution in the top left diagram of Figure 2, a strong ozone level gradient towards north suggests the base model (2) with the latitude of an agency station as predictor X and the corresponding longitude as auxiliary covariate Z. Response Y is the 8-h (9AM-4PM) average ozone concentration measured in parts per billion (ppb). The top right scattergram exhibits a predictor-response view of the data and the varying range of points spreading out along latitude suggests heteroscedasticity. For example, the variation is relatively large around X = 42 with extreme values near bottom and top. Those potential outliers may be explained in our framework of auxiliary covariates driven volatility. All the estimated curves give an increasing trend as our intuition. Linear regression is a reasonable simplification of data distribution while nonparametric estimator can extract additional useful details. The solid curve and the dot-dashed curve denote S-estimator of and E-estimator for directly observed data, respectively. We can see that both nonparametric curves agree on the left tail and the difference enlarges in high latitude region with more variation in data. The curves reflect estimators' understanding of heteroscedasticity and also provide useful alternative views of distribution. Moreover, 95% pointwise and simultaneous confidence bands of S-estimator are provided according to Efromovich (2018). Since both linear regression and E-estimator are inside pointwise confidence band, they are acceptable for this small dataset of 147 observations. Next, we consider the impact of missing mechanism on estimators by modifying response using Bernoulli availability variable with availability likelihood w(x, y, z) = 0.6 + 0.4x for MAR response and w(x, y, z) = 0.8 for MCAR response, which gives 120 complete cases in the bottom diagrams. Here, we also draw a magenta dotted curve for the H-sample S-estimator as a reference. Both M-sample S-estimator and E-estimator are shifted down in the left diagram of MAR-sample but our S-estimator still closely follows its H-sample counterpart except at tails, where only a few points exist even in H-sample. It is interesting that their understanding of left tail is almost the same while the high variation of cluster around latitude 42 seems to determine the location of right tail. In the right diagram, both M-sample estimators are very close to their locations in H-sample since MCAR is a neutral missing mechanism. Again, all estimators are inside confidence bands of S-estimator. Although sample size is reduced by missing, we obtain a narrower confidence band since missing mechanism would remove some data points with high variation. Although realisations are influenced by randomness, Figure 2 supports the good performance of our proposed procedure on small dataset.

Conclusion
In this paper, we consider a nonparametric regression problem with complexity due to both heteroscedasticity and MAR responses. It is known that efficient nonparametric estimator for univariate regression does not require knowledge of scale function in both directly observed data and MAR response setting. However, explicitly estimating scale function is important for preservation of efficiency when heteroscedasticity involves auxiliary covariates. Although MAR response introduces bias in data, we showed that efficient estimation is still possible. We developed asymptotic theory of sharp minimax estimation under this heteroscedastic regression setting with missing data and proposed a data-driven efficient procedure. A general additive model is also considered as an extension, where sharp minimax estimation is established under some mild assumptions on nuisance functions. Monte Carlo simulation studies and real data examples shed light on the feasibility of asymptotic theory for small datasets. It may also be of interest to consider the more challenging problem of MAR predictor under this auxiliary covariates involved heteroscedastic regression model in the future, where efficient estimation theory is not well developed even under the pure univariate setting.