Improved kth power expectile regression with nonignorable dropouts

The kth ( ) power expectile regression (ER) can balance robustness and effectiveness between the ordinary quantile regression and ER simultaneously. Motivated by a longitudinal ACTG 193A data with nonignorable dropouts, we propose a two-stage estimation procedure and statistical inference methods based on the kth power ER and empirical likelihood to accommodate both the within-subject correlations and nonignorable dropouts. Firstly, we construct the bias-corrected generalized estimating equations by combining the kth power ER and inverse probability weighting approaches. Subsequently, the generalized method of moments is utilized to estimate the parameters in the nonignorable dropout propensity based on sufficient instrumental estimating equations. Secondly, in order to incorporate the within-subject correlations under an informative working correlation structure, we borrow the idea of quadratic inference function to obtain the improved empirical likelihood procedures. The asymptotic properties of the corresponding estimators and their confidence regions are derived. The finite-sample performance of the proposed estimators is studied through simulation and an application to the ACTG 193A data is also presented.


Introduction
In population health, biological researches, economics, social sciences, data are often collected from every sampled subject at many time points, which are referred to as longitudinal data. Our study is motivated by the following longitudinal data from the AIDS Clinical Trial Group 193A (ACTG 193A), which was a study of HIV-AIDS patients with advanced immune suppression. After the treatment was applied, the CD4 counts were scheduled to be collected from each patient in every 8 weeks. However, because of adverse events, low-grade toxic reactions, the desire to seek other therapies, death and some other reasons, the dropout rates of the first four follow-up times are 31.5%, 42.4%, 55.4% and 65.3%, respectively. In addition, previous experiences from doctors and Cho et al. [3] indicated that a steep decline in the CD4 cell counts indicates the disease progression, and patients with low CD4 cell counts are more likely to drop out from the scheduled study visits compared to patients with normal CD4 cell counts. Therefore, nonresponse of the CD4 cell counts is likely related to itself and is nonignorable [31]. On the other hand, it can be checked that the distribution of CD4 cell counts is skewed. Hence, the response mean regression may not appropriately assess the longitudinal change in the CD4 cell counts.
Compared to conditional response mean regression, in some studies about the weights in children growth, high expenses in medical cost and so on, we usually postulate the quantile regression (QR; [11]) and expectile regression (ER; [1,18]), which have many advantages over traditional mean regression methods, especially in the presence of heavytailed errors. Moreover, both the QR and ER can capture a complete picture of the relationship between the response and predictors. On the other hand, although the least absolute deviation makes the QR method robust, its check function is nondifferentiability at some parameter points which makes computation difficult. Unlike the QR method, the ER replaces the L 1 loss by a quadratic L 2 loss, which leads to the asymmetric least squares, such that it is easy to compute the estimates but more sensitive to the tails of distributions. Overall, the computation of ER is much easier than that of QR, but the QR is more robust than the ER. In order to simplify computation and balance robustness and effectiveness, the kth power ER is proposed by Jiang et al. [9]. In specific, the kth power expectile of a random variable Y is defined as the solution μ(Y) which minimize the loss function E{Q(Y − θ)} over θ ∈ R for a fixed τ ∈ (0, 1), where is called as the asymmetric kth power loss function. It can be seen that the loss functions of the QR and ER are the two special cases of the kth power loss function with k = 1 and k = 2, respectively. For 1 < k < 2, the loss function of the kth power ER falls in between the loss functions of the QR and ER, which can be termed as a trade-off between the QR and ER methods. For the longitudinal ACTG 193A data, a major aspect is how to take into account the within-subject correlation structure to improve estimation efficiency. The most commonly used approach is generalized estimating equations (GEEs; Liang and Zeger [13]) based on a working correlation matrix. Recently, Huang et al. [8], Bai et al. [2], Leng et al. [12], Zhang and Leng [32], Fu and Wang [6], Zhang et al. [33], Lv et al. [15] and Xu et al. [30] proposed many methods to approximate or incorporate the within-cluster correlation structure. Among these methods, the quadratic inference function (QIF; Qu et al. [21]) has recently received considerable attention due to its flexibility and simplicity [3,4,27]. Moreover, noticing that the CD4 cell counts drop out of the study prior to the end of the study. The majority of existing methods only naturally accommodate ignorable dropout or missing at random (MAR; [14]) assumption. Unfortunately, developing valid methodologies for statistical analysis with nonignorable dropout is challenging [28]. In this case, the population parameters are, in general, not identifiable [5]) if there is no assumption imposed, and estimators based on the assumption of ignorable dropouts may have large biases. Thus, methods which are very different from those for an ignorable propensity have to be applied; for example, see Molenberghs [17], Kim and Yu [10], Wang et al. [29], Shao and Wang [25] and so on. To address the identifiability issue, Tang et al. [26], Wang et al. [29], Shao and Wang [25] and Wang et al. [28] found that two key assumptions are needed. The first key assumption is that either p(r i |x i , y i ) or f (y i |x i ) has a parametric form [23,28]. The second key assumption imposed by Tang et al. [26], Wang et al. [29], Shao and Wang [25], Fang and Shao [5] , Miao et al. [16] is that x i can be decomposed as two parts x i = (u i , z i ) and z i is unrelated to dropout propensity conditioned on (u i , y i ), i.e. Pr(r i |x i , y i ) = Pr(r i |u i , y i ). This assumption shows that z i introduces additional conditional independence and affects the propensity only through its association with the outcome y i . More importantly, z i can be used to create more instrument-based estimating equations, which helps to identify and estimate the unknown parameters in the propensity [28,29]).
In this paper, we propose a two-stage procedure to construct a class of improved kth power ER estimators, which can incorporate the within-subject correlations under an informative working correlation structure and account for nonignorable dropouts, by using the bias-corrected empirical likelihood (BEL) in conjunction with the QIF method. In the first stage, we impose a parametric model on the dropout propensity and then use a nonresponse instrument to deal with the identifiability issue. After the dropout propensity is consistently estimated, we construct the bias-corrected GEEs based on the inverse propensity weighting (IPW; [24]). In the second stage, we propose a classes of kth power ER estimators based on the QIF method, which can incorporate the within-subject correlations under informative working correlation structures.
The resulting EL ratio follows a weighted sum of chi-square random variables asymptotically, which can be used to construct the corresponding confidence regions. Our simulation results show that the proposed estimators are more efficient than estimators obtained under a working independence correlation structure, and the performance of the kth power ER is better than the performance of the ordinary ER in some data cases.
The rest of this paper is organized as follows. In Section 2, we first present the parametric dropout propensity and instrument approach, and then construct the proposed kth power ER estimators. The statistical properties are investigated in Section 3. Simulation studies are given in Section 4. Section 5 analyzes the AIDS Clinical Trial Group 193A data for illustration. Some discussions can be found in Section 6. All technical details are provided in the Appendix.

Methodology
Let y i = (y i1 , y i2 , . . . , y im i ) T be a m i dimensional vector of the ith subject's response, which may drop out, . . , r im i ) T be the vector of response indicators, where r ij = 1 if y ij is observed and r ij = 0 if y ij , . . . , y im i are not observed, i = 1, . . . , n and j = 1, . . . , m i . Here, m i is called as the size of the ith subject. We first investigate the balanced data with the same subject size m i = m and the unbalanced situation will be discussed in Section 6 later. Consider where y ij is the jth measurement on the ith subject, x ij is the p × 1 covariates, β is pvector unknown regression coefficients and ε ij is the random error for i = 1, . . . , n and j = 1, . . . , m. For a fixed τ ∈ (0, 1), we define the kth power expectile regression (ER) model for the problem (2) as (1) is to make sure the error is centered on the kth expectile. In order to take the nonignorable dropouts into account, define π ij = Pr(r ij = 1|x i , y i ). Based on the independence working model assumption and popular IPW approach, an estimator of β can be obtained by minimizing the following bias-corrected objective function due to the fact E{ r ij Motivated by Liang and Zeger [13], the optimal bias-corrected GEEs can be written as where being the true correlation matrix and marginal variance matrix of S(y i − x i β), respectively, φ 0 is a true dispersion parameter. Since R −1 is unknown in practice, we borrow the matrix expansion idea of QIF [21] to approximate R −1 . In specific, Qu et al. [21] assumed that R −1 can be approximated by a linear combination of several basis matrices, i.e.
where B 1 , . . . , B q are (m × m)-dimensional symmetric basic matrices depending on the particular choice of R −1 and b 1 , . . . , b q are unknown coefficients. More details can be found in Qu et al. [21] and Cho and Qu [4].
To obtain the estimator of β, we first need to get consistent estimators of π ij and A i , denoted asπ ij andÂ i , respectively. Using the identifiability approach proposed in Wang et al. [28], we assume that x i can be decomposed as u i and z i , i.e.
. . , y ij ) T as the history u ij , z ij and y ij up to and including cycle j, respectively. We further assume that the dropout propensity has a parametric form, where O ij = ( − u T ij , − y T ij ) T , α j is unknown parameter, γ j is a column vector of unknown parameters, is a known monotone function defined on [0, 1], r i0 is always defined to be 1. For any j = 1, . . . , m, to estimate the unknown parameter vector θ j = (α j , γ T j ) T , define the instrumental estimating equations and then the two-step generalized method of moments (GMM; Hansen [7] can be used to obtain the corresponding estimatorθ j . Under model (2), we have It can be verified that π ij is a monotone function and continuous almost surely. Hence, by continuous mapping theorem, π ij is a consistent estimator of π ij sinceθ 1 , . . . ,θ j are consistent estimators of θ 1 , . . . , θ j , respectively [28,29]. Subsequently, a consistent estimatorÂ i = A i (φ) can be obtained by computing the sample marginal variance based on the IPW approach with an initial estimate β * based on (3).
Thus, Equation (4) can be approximated by a linear combination of elements, . . .
With more estimating equations than the parameters, we propose to apply the following empirical likelihood for the inference of β under some regular conditions. Let p i represent the probability weight allocated toĝ i (β), i = 1, . . . , n. The empirical log-likelihood ratio function for β is defined aŝ and the maximum EL estimatorβ, can be obtained as below: Next, we will study the asymptotic properties ofR(β 0 ). Compared to the standard empirical log-likelihood ratio without missing data, the main difference is that theĝ i (β 0 ), i = 1, . . . , n, are not independent and identically distributed. Hence, the asymptotic distribution ofR(β 0 ) may not be a standard chi-square. Actually, we will show thatR(β 0 ) is asymptotically weighted sum chi-squares.

Theorem 3.2:
Under the assumptions in Theorem 3.1, as n → ∞, we havê where w i , i = 1, . . . , p × q are independent and follow the standard χ 2 distribution with one degree, the weights ρ i are eigenvalues of B −1 C.

Corollary 3.1: Under the assumptions in Theorem
be an adjustment factor. Along the lines of Rao and Scott [22], it is straightforward to show that In general, r(β) is a function of β and the quantity r(β 0 ) is a measure of information loss due to missing data. An exception occurs when there is no missing data and then r(β 0 ) = 1.

Simulations
We consider where x ij1 ∼ N(1, 1), x ij2 ∼ N(0, 1) and are independent, the random errors We generate these errors with ρ=0.7 such that ε i are strongly correlated. Under model (8), when the errors are homoscedastic, the τ kth power expectile of y ij given x ij is ) is the τ kth power expectile of ε i . When the errors are heteroscedastic, we Set the true value (β 0 , β 1 , β 2 ) = (0, 1, 2). To apply the proposed method, we use the working propensity model as follows: The missing indicators r i = (r i1 , r i2 , r i3 , r i4 ) T are generated from the following three choices: It can be seen that u ij = x ij1 and the instrument variable z ij = x ij2 . In addition, M1 is an ignorable dropout case and the unconditional dropout percentages for four time points computed are about 22.4%, 43.0%, 60.3%, 74.9%; M2 and M3 are two different nonignorable dropout cases, with the computed percentages about 27.8%, 48.7%, 63.1%, 75.7% and 24.7%, 45.4%, 61.2%, 74.8% respectively. The working model is correct under M1 and M2, but misspecified under M3, so that we could see the robustness of the proposed estimators. We compare the following estimators based onĝ i (β): (a) the proposed MNAR estimator based onĝ i (β) in (7) Here, the ignorable dropout propensity is imposed by a parametric linear logistic regression and the parameters are obtained by the two-step GMM estimator; (c) the complete case (CC) estimator based onĝ i (β) withŴ i = diag{r i1 , . . . , r im }; (d) the full sample (FULL) estimator based onĝ i (β) withŴ i = I m when there is no missing data, which is used as a standard.
We also explore three common working correlation choices in MNAR: independence (IND), compound symmetry (CS) and first-order autoregressive (AR(1)). For example, if a working correlation structure is CS, then where B 1 is an identity matrix and B 2 is a symmetric matrix with 0 on the diagonal and 1 elsewhere. If R −1 corresponds to AR(1), where B 3 is a symmetric matrix with 1 in elements (1, 1) and (m, m), and 0 elsewhere. Among the three common working correlation choices, the AR(1) structure best approximates the true correlation structure.

Absolute bias and mean square error
To evaluate the estimation efficiency of the these methods, we compute the simulated absolute bias is the sth estimate. Simulation results at τ = 0.5 are presented in Figures 1-4 and other results can be found in the Supplemental Material. In each Figure, the four rows correspond to Case 1, Case 2, Case 3 and Case 4, respectively; the three columns correspond to M1, M2 and M3, respectively. There are a few conclusions can be drawn from these simulation results.
(i) ABs. For each k, the proposed QIF estimators based on the kth power ER have good performance, even when the working dropout propensity model is wrong. The reason is that these estimators do not need to estimate their covariance matrices so that they have small variances and smaller MSEs. Among the three common correlation structures, it can be seen that the AR(1) estimator always has the smallest MSEs and the IND estimator has the largest MSEs. This indicates that we can obtain a more efficient estimator through assuming an informative working correlation structure that may not be correct, rather than ignoring the unknown correlation structure. In addition, it can be seen that the MSEs decrease as the sample size n becomes large; (iii) Among three different τ , the biases and MSEs when τ = 0.25 are slightly larger than the biases and MSEs when τ = 0.5 and 0.75, but the proposed kth power ER estimators still perform better than the CC and MAR estimators. (iv) Among different k values, it can be seen that the MSEs of the kth power ER estimators vary slightly, but the best k selection should be considered in terms of minimizing the MSEs for different errors. For example, under M2, we notice that k = 2 (the ER) works best for the normal error, k = 1.8 works best for the heteroscedastic t error, k = 1.9 works best for the heteroscedastic normal error. While, for the t error, we find k = 1. wherep i (β) is firstly obtained based on the proposedĝ i (β) in (7). The simulated MSE ratio of our proposed estimatorβ to the estimatorβ by Tang et al. [27], i.e. MSE(β)/MSE(β), are reported in Tables S1-S2 based on 1000 replications. It can be seen that the MSEs of β andβ are quite close. However, the computation of the estimatorβ proposed by Tang et al. [27] is more complex due to its two-step procedure.

Coverage probabilities
Under M2 with τ = 0.5, we further examine the confidence regions of the kth power ER in terms of the coverage probability (CP) with the best selection k. To construct the confidence region of β, we propose to obtain the estimators of C and B by the plug-in method. Based on Corollary 1, the 100(1 − α)% confidence region is where r(β) = (p × q)/tr(B −1Ĉ ). While, the EL confidence region based on the estimator (b) is obtained similarly by CI(α) with the ignorable dropout propensity, and the EL confidence regions based on the estimators (c)-(d) are obtained by CI(α) = {β :R(β) < χ 2 1−α (p)} withĝ i (β) in (7) replaced by the corresponding estimating equations under the CC and FULL methods, respectively. According to Qin and Lawless [20], only the full sample method can produce correct EL confidence regions. The simulated CPs are presented in Table 1 and a few conclusions can be drawn. The coverage probabilities based on the proposed estimators are close to the nominal level 0.99, and are quite comparable to the FULL method assuming no missing data. It can be seen that the coverage probabilities based on the MAR and CC methods have undercoverage.

Application to HIV-CD4 data
In this section, the proposed method is applied to the ACTG 193A data. The study collected the CD4 counts in every 8 weeks, from 316 patients taken the daily regimen containing 600 mg of zidovudine plus 2.25 mg of zalcitabine. The data set can be obtained from http://www.hsph.harvard.edu/fitzmaur/ala/cd4.txt. We consider the first four follow-up times, 8, 16, 24, 32, as four time points j = 1, 2, 3, 4, and use the CD4 counts in four time intervals, (4,12], (12,20], (20,28], (28,36], as the study variable y ij . In some time intervals, the ith patient may have more than one measurement, so we choose the last record as y ij at time point j. In addition, we have two continuous covariates, follow up time and age. It is reasonable to treat the follow-up times as u ij , which may affect the dropout, and use the baseline covariates ages as instruments z ij . The working propensity model (5) with (·) = [1 + exp(·)] −1 is used. Moreover, we choose the suitable values of k when the kth ER estimators have the smallest SEs, and these estimates and their SEs are reported in Table 2.
From Table 2, we find that all the estimates of β 1 are negative. This is reasonable because the number of patients' CD4 counts keeps decreasing as time goes on. At the same time, it can be seen that all the estimates of β 2 are positive. The reason is that these patients' ages are closely related to survival times and the patients with earlier ages are more likely to drop out of the study. In addition, we find the estimates are close for these three working correlation structures, while CC and MAR estimates have large differences when τ = 0.75.

Discussion
The above methods are presented with balanced data, that is, m i = m. For the unbalanced data, the proposed methods in Section 2 can be implemented directly, after using the transformation matrix of Zhou and Qu (2012) to each subject. In addition, we can show that the asymptotic results of Theorems 3.1 and 3.2 still hold for unequal subject sizes using the similar way in the proofs of the theorems. To apply our proposed methods, finding an instrument is important. In practice, an instrument may be obtained by one of the following three methods. (1) When y ij is obtained under some treatments, baseline covariates measured prior to treatments, such as age group, gender, race, and education level, are related to the study variable but are unrelated with the propensity, when the study variable and other covariates are conditioned, and thus can be considered as instruments. (2) A fully observed proxy or a mismeasured version of the missing response may be treated as an instrument. (3) A sensitivity analysis of studying some or all different decompositions of x i = (u i , z i ) [25].
Turning to the optimal k selection, we can calculateV in Theorem 3.1 for each value of k by the plug-in method and then find a favorable k value, for example k 0 , such that the sum of diagonal elements ofV of the corresponding k 0 th power expectile regression is smallest. We may adopt a simple bootstrap approach by resampling the paired observations (x i , y i , δ i ), i = 1, . . . , n, with replacement to approximate the asymptotic variance. For b = 1, . . . , B, based on the bootstrapped sample ( Alternatively, based on our simulation results, if we have Gaussian-like distribution of errors, choose 1.5 ≤ k ≤ 1.9; if we have heavy-tailed errors, choose 1.1 ≤ k ≤ 1.5; if we have nothing about the errors, k = 1.5 is appealing as a compromise between the robustness of k = 1 and the efficiency of k = 2. (C4) The random vectors x ij are bounded in probability for all i and j; C is nonsingular and B is positive definite.

Proof of Theorem 3.2.:
By the proof of Lemma A.1, and applying the similar arguments used in the proof of (2.14) in Owen [19], we can show that ||λ|| = O p (n −1/2 ).