Robust statistical inference for longitudinal data with nonignorable dropouts

In this paper, we propose robust statistical inference and variable selection method for generalized linear models that accommodate the outliers, nonignorable dropouts and within-subject correlations. The purpose of our study is threefold. First, we construct the robust and bias-corrected generalized estimating equations (GEEs) by combining the Mallows-type weights, Huber's score function and inverse probability weighting approaches to against the influence of outliers and account for nonignorable dropouts. Subsequently, the generalized method of moments is utilized to estimate the parameters in the nonignorable dropout propensity based on sufficient instrumental estimating equations. Second, in order to incorporate the within-subject correlations under an informative working correlation structure, we borrow the idea of quadratic inference function and hybrid-GEE to obtain the improved empirical likelihood procedures. The asymptotic properties of the proposed estimators and their confidence regions are derived. Third, the robust variable selection and algorithm are investigated. We evaluate the performance of proposed estimators through simulation and illustrate our method in an application to HIV-CD4 data.


Introduction
Longitudinal data arises frequently in medicine, population health, biological researches, economics, social sciences and so on. The observations are often collected repeatedly from every sampled subject at many time points, and thus intrinsically correlated. Our study is motivated by the following longitudinal data from the AIDS Clinical Trial Group 193A (ACTG 193A), which was a study of HIV-AIDS patients with advanced immune suppression. The data set can be accessed at http://www.hsph.harvard.edu/fitzmaur/ala/cd4.txt. For this HIV clinical trial, the CD4 cell count is of prime interest which decreases as HIV progresses. After the treatments were applied, the CD4 cell count was scheduled to be collected from each patient in every 8 weeks. However, because of adverse events, lowgrade toxic reactions, the desire to seek other therapies, death and some other reasons, the dropout rates of the first four follow-up times are 31.5%, 42.4%, 55.4% and 65.3%, respectively. Previous experiences from doctors indicate that a steep decline in the CD4 cell count indicates the disease progression, and patients with low CD4 cell counts are more likely to drop out from the scheduled study visits as compared to patients with normal CD4. Therefore, nonresponse of the CD4 cell count is likely related to itself and is nonignorable [1]. Our purpose is to examine whether the CD4 counts of young patients are more likely to decrease.
For the longitudinal data, a major aspect is how to take into account the within-subject correlation structure to improve estimation efficiency. A naive and simple way is to use a working independent model [2,3], but it ignores the correlation structure and may lose efficiency when strong correlations exist. In order to incorporate the within-subject correlation, Liang and Zeger [4] proposed generalized estimating equations (GEEs) through a working correlation matrix. However, two challenges related to GEEs have not been solved yet. First, it is difficult to describe and specify the underlying within-subject covariance matrix, which is unknown in practice and suffers from the positive-definiteness constraint. Second, the GEEs method is closely related to the weighted least squares method, which is not robust when the observed data contains outliers, such that the resulting estimators may be biased and their confidence regions may be greatly lengthened in the direction of the outliers [5]. To solve the first issue, Huang et al. [6], Bai et al. [7], Fu and Wang [8] approximated the covariance matrices with basis functions or possible dependence structures. Leung et al. [9] proposed a hybrid-GEE method that combines multiple GEEs based on different working correlation matrices. Leng et al. [10], Zhang and Leng [11], Zhang et al. [12] and Lv et al. [13] applied the Cholesky decomposition to obtain the within-subject covariance matrix. Li and Pan [14] and Leng and Zhang [15] constructed estimating functions by the quadratic inference function (QIF). Xu et al. [16] proposed a combined multiple likelihood estimating procedure based on three dynamic covariance models. Among these methods, the QIF and hybrid-GEE approaches have recently received considerable attention due to their flexibility and simplicity [17,18] . For the second challenge, the traditional robust M-estimation for longitudinal data have been discussed, by using the Mallowstype weights to downweight the effect of leverage points and adopting the Huber's score function on the Pearson residuals to dampen the effect of outliers; see He et al. [19], Wang et al. [20], Qin and Zhu [21], Qin et al. [22], Fan et al. [23], Zheng et al. [24] and so on.
Moreover, the CD4 cell count drops out of the study prior to the end of the study. It is well known that under nonignorable missing responses, the complete case analysis can not be trusted. The majority of existing methods only naturally accommodate ignorable dropout or missing at random (MAR; [25]) assumption. For the ACTG 193A data, however, the dropout is nonignorable or missing not at random (MNAR) such that developing valid methodologies for statistical analysis with nonignorable dropout is challenging [26]. In this case, the population parameters are, in general, not identifiable [27] if there is no assumption imposed, and estimators based on the assumption of ignorable dropouts may have large biases. Thus, methods which are very different from those for an ignorable propensity have to be applied; for example, see Molenberghs and Kenward [28] , Kim and Yu [29], Wang et al. [30], Shao and Wang [31], Bindele and Zhao [32] and so on. Furthermore, when the dimension of covariates is high, the generalized linear models (GLMs) often include many irrelevant covariates. In practice, it is useful to find which covariates are relevant for prediction, both for better interpretation of the model and for better efficiency of the estimators [33].
The existence of nonignorable dropouts and outliers and their impacts on the inference motivate us to find a robust and efficient estimation approach. To the best of our knowledge, this problem has not previously been investigated. Our contributions of this paper are in three aspects.
(1) We impose a parametric model on the dropout propensity and then use a nonresponse instrument, which is a useful covariate vector that can be excluded from the propensity given the response and other covariates, to deal with the identifiability issue [26,30]. The main technique is to create sufficient estimating equations to estimate the parametric propensity based on a given instrument. In specific, we apply the generalized method of moments (GMM; [34]) to estimate the propensity. (2) Once the propensity is estimated, we propose the improved robust and bias-corrected GEEs in the presence of outliers and nonignorable dropouts through the following three steps. First, in order to against the influence of outliers, the Huber's score function and Mallows-type weights are applied to build the robust GEEs. Second, the bias-corrected GEEs are constructed by the inverse propensity weighting (IPW; [35]) to account for nonignorable dropouts. Third, in conjunction with the quadratic inference function (QIF; [36]) and hybrid-GEE [9] methods, we construct two classes of improved estimators that can incorporate the within-subject correlations under an informative working correlation structure. In specific, the proposed QIF procedure is based on the matrix expansion idea, which neither assumes the exact knowledge of the within-subject covariance matrix nor estimates the parameters of the withinsubject covariance matrix. Alternatively, the hybrid-GEE method combines multiple GEEs based on different working correlation models to improve the estimation efficiency. The resulting two robust EL ratios are two different weighted sum of chi-square random variables asymptotically, which can be used to construct two corresponding confidence regions, respectively. (3) For the robust variable selection, we propose the penalized robust EL approach by combining the profile robust EL and the smoothly clipped absolute deviation (SCAD; [37]) methods together in Section 3. We show that the proposed variable selection method can efficiently select significant variables and estimate parameters simultaneously. Furthermore, the resulting estimators based on the QIF and hybrid-GEE methods are consistent and have the oracle property. In addition, an algorithm based on the local quadratic approximation (LQA) is proposed.
The rest of this paper is organized as follows. We propose the improved robust estimators and variable selection approaches based on the QIF and hybrid-GEE methods in Section 2. The asymptotic theories are investigated in Section 3. Simulation studies are given in Sections 4 and 5 analyzes the ACTG 193A data for illustration. Some discussions can be found in Section 6. All technical details and additional simulation results are provided in the Supplementary Material.

Robust GEEs
To illustrate our proposed methods, we first review the robust GEEs without missing data. Let y i = (y i1 , y i2 , . . . , y im i ) T be a m i dimensional vector of the ith subject's response which may drop out, x i = (x i1 , . . . , x im i ) T be an always observed (m i × p)-dimensional matrix of covariates associated with y i , i = 1, . . . , n and j = 1, . . . , m i . Here, m i is called as the size of the ith subject. We consider that y ij are modelled by where β is a p-dimensional parameter vector, g(·) is a known link function, μ ij = E(y ij ), φ is a dispersion parameter, v(·) is a known variance function and a T is the transpose of a. Without loss of generality, we consider the balanced data with the same subject size m i = m and the unbalanced situation will be discussed in Section 6 later. Motivated by Lv et al. [38] and Qin et al. [39], denote μ i = (μ i1 , . . . , μ im ) T and define the core robust estimating equations is a bounded function to downweight the influence of outliers, and the correction term C i = E{ψ(A −1/2 i (φ)(y i − μ i ))} is used to ensure Fisher consistency of the estimator. In specific, ψ(·) is the Huber's score function ψ(x) = min{c, max{−c, x}}, where the tuning constant c is used to balance the efficiency and robustness of the estimator. While, the weight w ij is considered based on the Mahalanobis distance [40] as follows: where b 0 is the 0.95 quantile of the chi-square distribution with the degree of freedom equal to the dimension of x ij , m x and S x are robust estimates of location and scale of x ij , such as minimum volume ellipsoid (MVE) estimates of Rousseeuw and van Zomeren [41], Lv et al. [38] and Qin et al. [39]. By using the Huber's score function and weight matrix, the influence of the outliers will be reduced.

Robust and bias-corrected GEEs
Motivated by the ACTG 193A data, let r i = (r i1 , r i2 , . . . , r im i ) T be the vector of response indicators, where r ij = 1 if y ij is observed and r ij = 0 if y ij , . . . , y im i are not observed, and our interest is to estimate the unknown β based on (y i , x i , r i ) for i = 1, . . . , n. To take into account the nonignorable dropouts, define π ij = Pr(r ij = 1 | x i , y i ). Subsequently, we propose the robust and bias-corrected GEEs as follows: where D i = ∂μ i /∂β, V i is the covariance matrix of (y i − μ i ) and S i = diag{r ij /π ij , j = 1, 2, . . . , m}. Moreover, it can be verified that the inverse of covariance matrix V −1 i can be decomposed as A with being an (m × m)-dimensional true correlation matrix. Unfortunately, is unknown in practice and one has to use a working correlation matrix R. Some common working correlation structures include independent structure, compound symmetry (CS) and first-order autoregressive (AR(1)). Then, the above equations can be written as To obtain the estimator of β, we first need to get the consistent estimators of π ij and φ, denoted asπ ij andφ, respectively. Using the identifiability and estimation approaches proposed in Wang et al. [26], we assume that x ij can be decomposed as two parts u ij and z ij , where u ij is continuously distributed and z ij can be continuous, discrete, or mixed. Denote T as the histories of y ij , u ij and z ij up to and including cycle j, respectively. According to Diggle and Kenward [42], we impose a parametric model on the propensity as follows: where is a vector of unknown parameters, is a known monotone function and r i0 is always defined to be 1. Given − y ij and − u ij , the instruments − z ij can be excluded from the nonresponse propensity, which are used to create sufficient estimation equations for estimating the propensity [26,30].
Under the model (2), . Similar to He et al. [19] and Wang et al. [20], a consistent estimator of φ can be obtained aŝ where β * is an initial estimate of β by using the MM method, i.e., a robust and efficient estimator in the presence of outliers [43,44].
Once we haveπ ij andφ, the unknown A , respectively. Then, we define the robust and unbiased estimating equations for β as follows:

Robust EL inference based on QIF and hybrid-GEE
In order to improve the efficiency, we borrow the matrix expansion idea of Qu et al. [36] and propose the quadratic inference function (QIF) by assuming that the inverse of the working correlation matrix R −1 can be approximated by a linear combination of several basis matrices, that is where B k ∈ R m×m , k = 1, 2, . . . , l, are symmetric basic matrices depending on the particular choice of R −1 and b 1 , . . . , b l are unknown coefficients. For example, if a working correlation structure is CS, then R −1 = b 1 B 1 + b 2 B 2 with an identity matrix B 1 and a symmetric matrix B 2 having 0 on the diagonal and 1 elsewhere. If R corresponds to AR(1), 3 with an identity matrix B 1 , a symmetric matrix B 2 having 1 on the sub-diagonal entries and 0 elsewhere, and a symmetric matrix B 3 having 1 in elements (1, 1) and (m, m), and 0 elsewhere. More details can be found in Qu et al. [36] and Cho and Qu [17]. Therefore, by substituting (5) into (4), it leads to Consequently, the equations (6) can be approximated as a linear combination of elementŝ g i (β) for i = 1, . . . , n, wherê . . .
Notice that the functionsĝ i (β) do not involve the parameters b 1 , . . . , b l , such that the estimation of these parameters is not required andĝ i (β) are the overdetermined estimating equations with pl functions. Thus, the EL method is proposed for the inference of β under some regular conditions. Let p i represent the probability weight allocated toĝ i (β) and the empirical log-likelihood ratio function is defined as follows: The maximum EL estimator based onĝ i (β), denoted asβ Q , can be obtained bŷ Alternatively, Liang and Zeger [4] considered R = R(α) with unknown nuisance parameter α and Leung et al. [9] proposed a hybrid method that combines multiple GEEs based on different and linearly independent choices of R(α), say R k (α), k = 1, 2, . . . , q, to improve the efficiency as follows: In practice, a few popular choices of R(α) may be used, i.e., CS and AR(1), and different values of α are specified by Leung et al. [9] to model the degree of intra-subject correlation. Similarly, the corresponding empirical log-likelihood ratio function is defined as follows: and the maximum EL estimator based onĥ i (β), denoted asβ H , is obtained bŷ

Variable selection
When the dimension of covariates is high, we attend to identify the zero coefficients consistently and estimate the nonzero coefficients efficiently and robustly. Therefore, we use penalized empirical likelihood (PEL) to do variable selection as follows: whereη can beĝ andĥ, p ν (·) is a penalty function with tuning parameter ν which can shrink small components of coefficients to zero. In this paper, we use the SCAD penalty, while other choices of penalty could possibly also be entertained. The first derivative of the SCAD is where a > 2 and ν > 0 are tuning parameters. As suggested by Fan and Li [37], a = 3.7 is recommended. To choose the optimal tuning parameter ν, three information criteria: BIC of Schwarz [45], BICC of Wang et al. [46] and EBIC of Chen and Chen [47], are considered as follows: where β ν is the estimate of β based on the QIF or hybrid-GEE methods with ν being the tuning parameter, and df ν is the number of nonzero coefficients in β ν .

Asymptotic theories
where β 0 and θ 0 are the true values of β and θ respectively.

Remark 3.1:
If π ij is known, we have H g = H h = 0, then the asymptotic covariance matrices ofβ Q andβ H can be simplified as

Remark 3.2:
By the plug-in method [48], we can obtain It can be proved thatˆ g andˆ g are consistent estimators of g and g , respectively. Once we haveˆ g andˆ g , the variance matrix g g T g can be estimated byˆ gˆ gˆ T g . Similarly, we can obtain the variance matrix estimator forβ H . Hence, Theorem 3.1 can be used to construct a normal-approximation-based confidence region.
As in Theorems 2 and 4 of Tang et al. [48] and many others, for simplicity we only need to study the asymptotic properties ofR Q (β 0 ) andR H (β 0 ). Compared to the standard empirical log-likelihood ratio without missing data, the main difference is that botĥ . . , n, are not independent and identically distributed. Hence, the asymptotic distributions ofR Q (β 0 ) andR H (β 0 ) may not be standard chi-squares. Actually, we will show thatR Q (β 0 ) andR H (β 0 ) are two different weighted sum of chi-square random variables asymptotically.

Remark 3.3:
In general, when the parameters are over-identified, one can define the empirical log-likelihood ratio functions [49]. For nonignorable missing data, however, the Wilks's theorem based on the EL does not hold any more [48]. Lemmas 1 and 2 in the Supplementary Material reveal the reasons why the limiting distributions ofR Q (β 0 ) andR H (β 0 ) do not follow the standard chi-square. Thus, it also can be shown that W Q (β 0 ) and W H (β 0 ) do not follow Wilks's phenomenon.

Remark 3.4:
When there are no missing data, we have g = g and h = h such that both −1 g g and −1 h h equal to the identity matrix, which makes the Wilks's theorem hold. This is the same as the result of Li and Pan [14]. Moreover, Theorem 3.2 can be used to test the hypothesis H 0 : β = β 0 and construct the confidence region for β 0 .
Along the lines of Rao and Scott [50], we have the following corollary.

Remark 3.5:
To construct the confidence regions ofβ Q according to Corollary 3.3, 100(1 − α)% confidence region forβ Q is give by Similarly, we can obtain the confidence regions ofβ H .
Let A = {j : β j0 = 0} be the set of nonzero components of true parameter vector β 0 with its cardinality as d = |A|. Without loss of generality, the parameter vector can be partitioned as two parts Through the Equation (9), we can perform variable selection and also produce robust estimators of the nonzero components.
T as the resulting penalized robust estimators based on the QIF and hybrid-GEE methods in (9), respectively. Here, the true parameter where (11) g and (11) h are dl × dl and dq × dq sub-matrices of g and h , (12) g and (12) h are dl × d and dq × d sub-matrices of g and h , (11) g and (11) h are dl × dl and dq × dq sub-matrices of g and h ,

Simulation studies
We conduct simulation studies to examine the finite-sample performance of the following estimators of β: (a) the proposed robust QIF estimators based onĝ i (β) in (7) and the robust hybrid-GEE estimators based onĥ i (β) in (8) with nonignorable dropout propensity π ij (θ j ) inŜ i and the GMM estimatorθ j obtained by (11). Computation details can be seen in  (7) and (8) with T i (μ i (β)) = y i − μ i are also obtained, which are denoted as QIF AR (1) , QIF CS , Hybrid 0.4 and Hybrid 0.7 , respectively. (b) the robust MAR estimator (denoted as R-MAR) based on (5) with the ignorable dropout π ij (Υ j ) = π ij ( − x ij ,Υ j ) inŜ i and true correlation structure R = . Here, the ignorable dropout propensity Pr(r ij = 1 | r i(j−1) = 1, x ij ) is imposed by a parametric linear logistic regression and the GMM estimatorΥ j is obtained similarly by (11). While, the non-robust MAR estimator (denoted as MAR) is also obtained. (c) the robust full sample (denoted as R-FULL) estimator based on (5) with the true correlation structure R = and S i = I when there is no missing data, which is used as a gold standard. While, the non-robust full sample (denoted as FULL) estimator is also obtained.
We consider as in Normal errors.
If ξ 0 j is the true value of ξ j , it can be verified that E{s j (y i , x i , r i , ξ 0 j )} = 0. The efficient twostep GMM [34] estimator of ξ j iŝ To see the robustness of the proposed estimators, we consider the following three cases: (Case0) no contamination based on observed y ij and x ij ; (Case1) randomly choose 10% of observed y ij to be y ij + 10; (Case2) randomly choose 2% of observed x ij to be x ij + 2 and 10% of observed y ij to be y ij + 10.
To save space, Tables 1-2 only report the simulation results under multivariate t errors with the AR(1) and CS correlation structures and the rest of the results are provided in Tables S1-S4 of the Supplementary Material. A few conclusions can be drawn from the results: (1) When there are no outliers (Case0), the proposed robust QIF and hybrid-GEE estimates are almost the same as those non-robust QIF and hybrid-GEE estimates in terms of the biases, SDs, RMSEs and CPs. These results indicate that our proposed robust estimators seem to perform no worse than non-robust estimators by introducing the weight matrix and the Huber's score function, which control the degree of robustness and efficiency. While, the MAR estimates based on ignorable dropout have large biases and RMSEs under nonignorable missing models, which makes their CPs much lower. (2) In the presence of outliers (Case1 and Case2), the non-robust estimates can impact greatly in terms of the biases, SDs, RMSEs and CPs, while the proposed estimates based on the robust QIF and hybrid-GEE methods still perform well, which indicates that the influence of outliers in responses and/or covariates can be reduced. The biases and RMSEs of the MAR estimates become more larger due to nonignorable missing models and outliers. Among the two contaminated cases, the SDs and RMSEs of all estimates increase as the contamination level increases, but the proposed robust QIF and hybrid-GEE estimates have much smaller magnitudes of increase. (3) Among the two robust QIF estimators, it can be seen that the estimates R-QIF AR (1) have smaller or comparable RMSEs than these based on the estimates R-QIF CS , which indicates that the AR(1) structure best approximates the true correlation structure. Among the two robust hybrid-GEE estimators, the estimates R-Hybrid 0.4 have smaller  RMSEs. These findings are consistent with our theoretical result that the choice of correlation matrix does not affect the consistence, but will affect the efficiency. (4) When n = 500, the SDs and RMSEs become smaller and the proposed four estimators have similar performance.

Misspecified propensity
In the second simulation, we investigate the performance of the proposed estimators when the propensity is misspecified. In specific, we consider the same settings as in the first simulation with normal and multivariate t errors, expect that  Here, the unconditional dropout percentages for four time points are about 18%, 35%, 52% and 66% for j = 1, 2, 3, 4. Use the same method to estimate ξ j . In this case, the working propensity model is misspecified so that we can see the robustness of the proposed estimators. Simulation results under multivariate t errors with the AR(1) and CS correlation structure are presented in Tables 3-4. The results under normal errors with the AR(1) and CS correlation structure are given in Tables S5-S6 of the Supplementary Material. We have the similar results as in the first simulation. The proposed estimators have robust and efficient results, even the working dropout propensity model is wrong.

Variable selection
In the third simulation, we assess the finite sample performance of variable selection based on the proposed estimators with SCAD penalty in terms of model complexity (sparsity), Table 3. Biases, standard deviations (in parentheses), relative mean squared errors (RMSEs) and coverage probabilities (CPs) in the second simulation under multivariate t errors with the AR(1) correlation structure.
The iteration stops when solutions converge to a satisfying precision. For a prespecified value ζ , setβ j = 0 if |β j | < ζ and we apply the algorithm in Owen [5] to computeλ. Table 5 only reports the results for n = 200, 500 and p = 10, 20 with the AR(1) error based on three information criteria: BIC, BICC and EBIC, for selecting the tuning parameter ν. The rest of the results under the CS error are presented in Table S7 of the Supplementary Material. We obtain the mean square errors (MSE) defined by MSE(β) = (β − β) T (β − β). Columns 'C' and 'IC' are measures of model complexity, with 'C' representing the average number of nonzero coefficients correctly estimated to be nonzero, and 'IC' representing the average number of zero coefficients incorrectly estimated to be nonzero. The simulated results of the oracle model (i.e., the model using the true predictors) are also reported.
From Tables 5 and S7, it can be seen that: (1) the proposed variable selection methods can select all three true predictors and the average numbers of zero coefficients incorrectly estimated to be nonzero are close to zero in most of cases; (2) the simulated MSEs of the proposed methods based on the BIC, BICC and EBIC are close to that of the oracle EL, especially for larger sample sizes; (3) in terms of MSEs and ICs, it is interesting to note that the BIC and BICC have similar performance and the EBIC has the best performance in most of cases. These findings imply that the model selection results based on the proposed approaches are satisfactory and the selected models are very close to the true model; (4) based on these results, in practice we recommend to use the information criteria EBIC for selecting ν.

Application to HIV-CD4 data
For illustration, we apply the proposed methods to the AIDS Clinical Trial Group 193A described in Section 1. In this study, the CD4 counts were collected from 316 patients who took the daily regimen containing 600 mg of zidovudine plus 2.25 mg of zalcitabine. We consider the first four follow-up times, 8, 16, 24, 32, as four time points j = 1, 2, 3, 4, and use the CD4 counts in four time intervals, (4,12], (12,20], (20,28], (28,36], as the study variable y ij for j = 1, 2, 3, 4, because the realised follow-up time points might be a little different from the scheduled time points. A few patients had more than one measurement in one time interval, in which case we use the last record in that interval as y ij at time point j. Some patients returned to the study after they dropped out of the study. For simplicity, the measurements after they dropped out of the study are not used in the analysis. There are two continuous covariates: age (x ij1 ) and follow-up time (x ij2 ). We use the working propensity model (10) and (·) = [1 + exp(·)] −1 . Because of adverse events, low-grade toxic reactions, declining CD4 cell counts before the study end, the desire to seek other therapies, death and some other reasons, patients are willing to dropout. Hence, the follow-up times are treated as covariates u ij may affect the dropout. The ages are always observed and thus are used as instruments z ij .
The point estimates and their normal-approximation-based confidence intervals by bootstrap with replication size 300 are reported in Table 6. It can be seen that: (1) all proposed estimates of β 1 are statistically significant negative, which is reasonable since we have known that the number of CD4 counts of these patients keeps decreasing as time goes on and the trends become worse for those with lower CD4 counts. However, all the confidence intervals of the non-robust estimates include 0, which indicates the time effect on the number of CD4 counts is not significant; (2) the proposed estimates of β 2 are statistically significant positive, which indicates that patients with earlier ages infected by the HIV are more likely to have lower CD4 counts. On the contrary, all the confidence intervals of the non-robust estimates include 0 and show the infected ages effect on the number of CD4 counts is not significant; (3) the estimates based on the MNAR and complete-case (CC) assumptions are different in most of cases. Therefore, the ignorable dropout assumption Table 6. Estimates, standard errors (SE) and confidence intervals (CI) for the HIV-CD4 data based on the proposed methods when the instrument variable is age. MNAR CC is questionable; (4) compared with the non-robust estimates, our proposed estimates have smaller standard errors and shorter confidence intervals. A sensitivity analysis is implemented by choosing the follow-up times as the instrument z ij and the ages as the covariates u ij , respectively. The estimates and standard errors are reported in Table S8 in the Supplementary Material. It can be seen that our proposed estimates using z ij =age have much smaller standard errors, which also indicates that we should use the age as the instrument variable.

Discussion
In this paper, we first make use of the IPW method and an instrument variable to deal with the nonignorable dropout. To achieve the robustness against outliers, the Huber's score function and Mallows-type weights are then applied to build the robust and bias-corrected GEEs. Based on the QIF and hybrid-GEE, we construct two classes of improved estimators that can incorporate the within-subject correlations under an informative working correlation structure. The existing two-step GMM and EL approaches are used to obtain the proposed estimators. In addition, we propose the robust variable selection through the penalized robust EL by combining the profile robust EL and the SCAD methods together. The simulation results show that our proposed estimators perform well even when the working dropout propensity model is wrong; the proposed variable selection methods can efficiently select significant variables and estimate parameters simultaneously.
The above methods are presented with balanced data, that is, m i = m. In practice, longitudinal data may not be measured with the same subject size, and could be unbalanced due to experimental constraints. To configure the proposed methods for unbalanced data, we apply the transformation matrix to each subject. Similar in Zhou and Qu [51], we create the largest subject with a size m, which contains time points for all possible measurements, and assume that fully observed clusters contain m observations. We define the m × m i transformation matrix M i for the ith subject by removing the columns of the m × m identity matrix, where the removed columns correspond to unmeasured data/time points for the ith subject. Through the transformation,ĝ i (β) is replaced bŷ ⎞ ⎟ ⎠ .
Therefore, for unbalanced data with dropout, parameter estimation and variable selection also can be implemented using our proposed methods in Sections 2. We can show that the asymptotic results of Theorems 3.1, 3.2 and 3.4 still hold for unequal subject sizes using the similar way in the proofs of the theorems. When y ij is obtained under some treatments, baseline covariates measured prior to treatments, such as age group, gender, race, and education level, are related to the study variable but are unrelated with the propensity, when the study variable and other covariates are conditioned, and thus can be considered as instruments. A fully observed proxy or a mismeasured version of the missing response may be treated as an instrument. A sensitivity analysis of studying some or all different decompositions of x ij = (u ij , z ij ).
Several further problems need to be investigated. First, the dimension of the covariates in regression models is assumed to be fixed, thus it is interesting to extend our approach with diverging numbers of parameters [52]. Moreover, the efficiency can further be improved by incorporating population level information based on empirical likelihood approach. Second, in our simulations the constant c is fixed. Actually, if the errors are normally distributed and there is no contamination, the best choice of c is ∞; if the errors follow a heavy-tailed distribution, c should be chosen to be a small positive value. Thus, the choice of tuning constant c in Huber's score function may impact the estimation efficiency such that we may propose a Huber's score function with a data-dependent tuning constant as in Wang et al. [53] as follows. Onceφ is obtained, the data-dependent value c can be estimated bŷ τ (c) = n i=1 m j=1 I(|ê ij | ≤ c) 2 nm n i=1 m j=1 I(|ê ij | ≤ c)ψ 2 (ê ij ) + c 2 I(|ê ij | > c) with an initial estimate β * . Note thatτ (c) is not a continuous function of c, in practice, the optimal c can be determined byτ (c) in the range of c ∈ [0, 3] and we obtain the c value that makesτ (c) the largest. Third, we may consider the model with heterogeneous effects and apply concave penalty functions to subgroup analysis [54]. Fourth, the efficiency of proposed GMM estimatorξ j depends on the choice of s j (y i , x i , r i , ξ j ), which may not be optimal since we only use the first order moments of data, i.e., (1, − u T ij , − z T ij , − y T i(j−1) ) T . Other moments or characteristics of may provide more information and, hence, result in more efficient GMM estimators.