Robust estimation of mean and covariance for longitudinal data with dropouts

In this paper, we study estimation of linear models in the framework of longitudinal data with dropouts. Under the assumptions that random errors follow an elliptical distribution and all the subjects share the same within-subject covariance matrix which does not depend on covariates, we develop a robust method for simultaneous estimation of mean and covariance. The proposed method is robust against outliers, and does not require to model the covariance and missing data process. Theoretical properties of the proposed estimator are established and simulation studies show its good performance. In the end, the proposed method is applied to a real data analysis for illustration.


Introduction
Longitudinal data commonly arise in many research areas such as public health, medicine and social science, which is characterized by the dependence among the observations within the same subject. Different from independent data, within-subject covariance of the response plays an important role in the analysis of longitudinal data. For example, it is well known that ignoring the covariance may lead to loss of estimation efficiency for mean models [20,22,27]. In practice, the covariance itself is of great interest when the study aims to explore the association among the repeated responses. Therefore, simultaneous estimation of the mean and covariance has attracted substantial attention recently. For longitudinal studies with complete data, an incomplete list on this topic includes Pourahmadi [10,11], Ye and Pan [23], Leng et al. [6], Mao et al. [8], Zheng et al. [26] and Yao and Li [22].
Compared with considerable literatures on simultaneous estimation of mean and covariance for longitudinal studies with complete data, the research for incomplete data is limited. Missing data are almost unavoidable due to various reasons during the period of longitudinal study which is designed to collect data for each subject at previously specified time points. Some subjects may miss appointments or drop out before the end of follow-up. Excluding the missing data may produce biased estimates and thus result in invalid statistical inference. Hence, the classical statistical methods developed in the case of complete data may not work and some approaches for dealing with the missing data are proposed such as the inverse probability weighting (IPW) methods and imputation methods. There are large amounts of literatures on estimation of the mean for longitudinal data with missing values. For example, Robins et al. [17], Preisser et al. [12], Little and Rubin [7], Yi and He [24] and Chen and Zhou [2]. However, the research on estimation of both mean and covariance for longitudinal data with missing data is relatively limited. Recently, Garcia et al. [3] studied the estimation of mean and covariance for longitudinal data with response missing at random (MAR) based on simultaneously modeling the mean and covariance under the normality assumption of the random errors.
Generalized estimating equation (GEE) methods or likelihood methods are generally adopted for the estimation of the joint mean and covariance models. However, it is well known that both GEE and likelihood methods are sensitive to the outliers in the data. Recently, Zheng et al. [26] proposed a robust GEE method for the estimation of the joint mean-covariance model in the case of complete data. Qin et al. [14] developed a robust IPW-GEE method for the estimation of the mean for longitudinal data with dropouts. To the best of our knowledge, the work on robust estimation of both mean and covariance has not been discussed.
In this paper, we aim at the development of robust estimation of both mean and covariance for longitudinal data with dropouts. Under assumptions that random errors follow an elliptical distribution and all the subjects share the same within-subject covariance matrix which does not depend on covariates, we propose a robust approach to achieve this goal. The advantages of the proposed method can be summarized as follows. First, the proposed estimator is robust against outliers in the data. Second, unlike the common approach to model the covariance such as Garcia et al. [3], our method does not need to specify the covariance structure or model the parameters of the covariance. Thus, the issue of model misspecification can be avoided. Third, specification of the model for the missing data process is not required. In brief, the proposed estimator for the mean and covariance is robust against outliers and does not require to model the covariance and missing data process.
The paper is organized as follows. We will describe the model and develop the proposed method in Sections 2 and 3, respectively. Theoretical properties of the proposed estimator are presented in Section 4 and simulation studies are conducted in Section 5. In the end, the proposed method is applied to analyze a real data set.

Models
Suppose that a longitudinal study involves n subjects with m observations over time for each subject. A linear regression model is considered to model the relationship between the response Y ij and the covariates X ij as where β 0 is a p-dimensional vector of regression coefficients associated with X ij and i = ( i1 , . . . , im ) T independently follow continuous elliptical distribution with mean 0 m×1 and covariance matrix 0 . The whole class of continuous elliptical distributions is a very rich family, which can be used in many statistical procedures instead of the Gaussian distributions [4]. At time point j, the complete data for the ith subject are {Y ij , X ij }, where Y ij is a response variable and may be missing, and X ij is a p-dimensional vector of covariates. Denote Y i = (Y i1 , . . . , Y im ) T and X i = (X i1 , . . . , X im ) T . In this paper, we focus on robust estimation of both β 0 and 0 in the case of MAR.
Let R ij = 1 if Y ij is observed; R ij = 0, otherwise. We here consider the case of response MAR. The MAR mechanism is commonly assumed in the literatures on missing data since this mechanism is usually in accordance with most practical issues [9]. The response MAR means that the distribution of the missing data indicators R i = (R i1 , . . . , R im ) T conditional on Y i and X i only depends on the observed data. DenoteR ij = {R i1 , . . . , R i,j−1 } as the history of missing data indicators at time point j, and let Y o i denote the observed components of Y i . Then, the MAR mechanism considered here can be clearly expressed as P( Furthermore, we here consider the monotone missing pattern, that is, R ij = 0 indicates R ik = 0 for all k > j. We assume R i1 = 1 for each subject.

Proposed method
In this section, we develop a robust method for simultaneous estimation of the mean and covariance.
According to Cholesky decomposition, there exists a lower triangle matrix C with 1s on the main diagonal such that cov where B is a diagonal matrix. Denote e i = (e i1 , . . . , e im ) T = C i and then we have i1 = e i1 and . Based on the relationship between e i and i , similar to Yao and Li [22], we can rewrite model (1) as where var(e ij ) = b 2 j , j = 1, . . . , m.
Note that the errors ij s are unknown, but they can be predicted byˆ ij = Y ij − X T ijβ f , whereβ f is obtained by solving the following robust estimating equation based on the first observation of each subject which are independent.
where W i1 = w(x i1 ) with w(·) being a function of covariates which is used to downweight the impact of leverage points, ψ is a bounded score function which is used to reduce the effects of outliers in the response and is chosen to be the Huber function ψ(x) = min{c, max{−c, x}}.
The tuning parameter c is typically selected to balance estimation efficiency and robustness. We refer to Wang et al. [21] for details on this selection. Here similar to He et al. [5], we choose c = 1.5. Moreover, following Sinha [19], the weighting function w(x i1 ) is chosen to be min which is taken to be 1 in the latter simulation studies and real data analysis, where b 0 is chosen as the 95th percentile of Chi-square distribution with degrees of freedom equal to the dimension of x i1 , and m x and S x are some robust estimates of location and scale of x i1 , such as minimum volume ellipsoid estimates of Rousseeuw and van Zomeren [18]. We make use of the independence among the first complete observation to obtain a robust and consistent estimateβ f of regression parameters β 0 , thus ij can be properly estimated and the consistency of the following proposed robust estimator can be achieved.

Remark 2
Since estimating Equation (4) only uses the first observation of each subject, values of each covariate are required to be different at the first observation to ensure that the regression parameters β in Equation (4) are estimable.
Then, replacing ij in Equation (3) withˆ ij , we have Let and Equation (5), we can write the model in the matrix format: Then making use of all the observed data and utilizing the common idea for construction of robust estimating equation similar to Equation (4), the estimator of θ 0 ,θ = (β T ,ς T ) T can be obtained by solving the robust estimating equation: where . . , w im } are weighting matrix used to downweight the impact of outliers in the combined covariatesD i with the same weighting function as used in estimating Equation (4). In estimating Equation (7), the function ψ is applied to each component of the vector V Remark 3 Under the missing mechanism of MAR and the assumption that i follow elliptical distribution, it can be shown that Under some regularity conditions assumed in Section 4, the solution of estimating Equation (7) can be shown to be consistent and asymptotic normality.
With the estimateς, the variances of random variables e ij can be robustly estimated using the median absolute deviation (MAD) based on the predicted value of e ij . Then, we can obtain the estimate of the covariance matrix which is denoted byˆ 0 .
We summarize the estimation procedure in the following.
Step 1. Based on the first observation of each subject, solving the robust estimating equation (4) to obtain a robust estimate of β 0 ,β f . And then obtain the predicted valueˆ Step 2. Based on model (6), applying the robust estimating equation (7) to obtain the proposed robust estimator of θ 0 = (β T 0 , ς T 0 ) T ,θ = (β T ,ς T ) T . Furthermore, we can obtain the predicted valuesê ij = Y ij −D T ijθ and obtain the robust estimates of the variance b 2 j through the median absolute deviation.
Step 3. With the estimateς andd 2 j and according to the Cholesky decomposition, the estimate of covariance matrixˆ 0 is obtained.
Through the above procedure, the proposed robust estimator of both mean and covariance can be achieved.
From the above estimation procedure, we summarize some features of the proposed method as follows. First, robust techniques such as the robust estimating equations are used to reduce the impact of outliers in the data. Second, based on the rewritten regression model (3), the proposed method can achieve consistent estimates under MAR without specification of the model for the missing data process. Third, through the Cholesky decomposition, the common regression model (1) can be rewritten as (3) which incorporating the covariance parameters as a part of regression parameters. Then, directly applying the robust estimating equation for the rewritten model (3) without further modeling the covariance parameters, simultaneous estimation of both mean and covariance is realized.

Asymptotic properties
In this section, we will show the large-sample properties of the proposed estimator under some regularity conditions. Denote To establish the asymptotic normality of the proposed robust estimatorθ, some regularity conditions are assumed.
converges uniformly to a non-stochastic function E(β) which has a unique zero at β 0 ∈ ϒ. (C.2) There is a neighbor G(θ 0 ) of θ on which with probability one, U n (θ ) is continuously differentiable. (C.3) With probability one, (1/n)P n → P and (1/n)Q n → Q for some positive definite matri- The parametric space is compact, for any θ in . With probability one, (1/n)U n (θ ) converges uniformly to a non-stochastic function U(θ ) which has a unique zero at θ 0 ∈ . (C.5) With probability one, (1/n)S n → S and (1/n)K n → K for some positive definite Under Conditions (C.1)-(C.3), it can be shown thatβ f has asymptotic normality which is presented in Lemma 1 in the supplementary online material. Further combining Conditions (C.4) and (C.5), the asymptotic normality of the proposed estimatorθ can be achieved. Moreover, Condition (C.2) that U n (θ ) is continuously differentiable is also assumed in Sinha [19]. As indicated in Sinha [19], the derivative of the Huber function ψ(·) is not continuous at the points ±c, and as a remedy, it is possible to smooth the function ψ(·) to get a continuous derivative. However, the use of such smoothed ψ(·) would not bring dramatic change for the parameter estimation. A brief proof is shown in the supplementary online material. Note that to derive the asymptotic covariance matrix , the variation induced by the estimatorβ f has to be considered, which is reflected in the matrix K n in Condition (C.5).
Following Qin and Zhu [13], K and S can be consistently estimated by 1/n timeŝ Thus, statistical inference can be made on θ 0 .

Simulation studies
In this section, simulation studies are conducted to investigate the performance of the proposed estimator. We compare the proposed method with robust inverse probability weighting GEE (IPW-GEE) method [14] and robust complete-case GEE (CC-GEE) method [5] as well as their non-robust versions. The robust methods reduce to their corresponding non-robust versions when ψ(x) = x and w(x) = 1. Furthermore, IPW quadratic inference function (QIF) [15] estimator which incorporates the IPW into the scores for dealing with missing data and the estimatorβ f which is based on the first observations of the subjects are also calculated for comparison. Under each simulation setting, 500 replications are conducted.
The regression model considered in the simulation study is where X 1,ij is generated from standard normal distribution and X 2,ij follows binary distribution with success probability 0.6, β = (β 0 , β 1 , β 2 ) T is taken to be (1, 1, 1) T , i = ( i1 , . . . , im ) T follows a multivariate normal distribution with mean zero and covariance matrix which is taken to be exchangeable (EX), one-order autoregressive (AR1) and unstructured (UN) structures, respectively. And in the case of EX and AR1, the correlation parameter is taken to be 0.6. In the UN case, the covariance matrix is taken to be The number of subjects n is taken to be 100 and the number of the observations within each subject m is taken to be 6.
. A common logistic model is used to model the dropout process where γ = (γ 0 , γ 1 ) T is taken to be (1, 1) T and (0.1, 1) T , which will result in about 25% and 40% missing rates, respectively. We calculate the bias, standard error (SE) and mean square error (MSE) for the estimators of parameters in the mean. To investigate the performance of the proposed method in estimation of the covariance, we calculate the entropy loss E ( ,ˆ ) = trace( −1ˆ ) − log | −1ˆ | − m and quadratic loss Q ( ,ˆ ) = trace( −1ˆ − I) 2 , which can reflect the accuracy in estimation of the covariance matrix where is the true covariance matrix andˆ is its estimate. Whenˆ = , the loss is zero. Smaller loss indicates more estimation accuracy. In our simulation studies, the exchangeable structure is taken to be the working correlation structure for the IPW-GEE and CC-GEE methods.
To study the robustness of the proposed estimator, we consider two ways to generate outliers.
C1. Six completely observed points are randomly chosen. Their covariates X ij are replaced by X ij − 2 and their responses Y ij are replaced by Y ij + 2. C2. Twelve completely observed points are randomly chosen. The way to create outliers is the same as that in C1. Tables 1 and 2 summarize the simulation results. The simulation results of the bias and MSE under AR1 correlation structure are similar to those under the EX structure, and thus are omitted for brevity. It can be observed from Figures 1 to 4 that in the case of no outliers, the proposed robust method is able to provide consistent estimates for the mean and perform almost equally well with its non-robust version. Furthermore, it generally shows better performance than other estimators in comparison with smaller MSEs. The IPW-GEE and IPW-QIF methods can provide consistent estimates but with larger SEs induced by the IPW. And it is not surprising that the CC-GEE method produces inconsistent estimates due to missing data. From Table 1, the proposed robust estimator for the covariance seems to provide less accurate estimation. This is a premium which one needs to pay for using the robust method when there is, in fact, no outliers in the data. However, in the case of outliers, it can be found that the gain from the use of robust method is huge. The non-robust estimators are heavily influenced by the outliers and all the robust estimators provide smaller biases than those by the non-robust estimators. Specifically, the proposed robust estimator for the mean outperforms the other estimators with smallest MSE, and the proposed robust estimator for the covariance show its robustness with much more estimation accuracy than its non-robust version. In summary, the proposed robust estimator shows its strength in dealing with both robustness and missingness as well as its advantage over the IPW-GEE, IPW-QIF and CC-GEE estimators.       Table 2 presents the average estimated SEs based on 500 replications. It can be found that the empirical SEs are close to the estimated ones which indicates that the large-sample estimate of the variance for the proposed estimator is quite acceptable.

Application to a real data
For illustration, the proposed method is applied to a real data set. The data set is a subsample of the database from the National longitudinal Survey conducted by the US Department of Labor [1]. The sample consists of 716 women who were all interviewed for five times at scheduled years and have complete observation values.
It is of interests to explore the factors which significantly affect the wage of the women. The response variable is log transformation of wage. The explanatory variables include Education (EDU), total labor force experience (EXP), tenure in current job (TENU) and dummy variables SOUTH (1, if the woman is from the South; 0, otherwise), BLACK (1, if the woman is black; 0, otherwise) and UNION (1, if the woman is the union member; 0, otherwise).
We fit the data through a linear regression model and first analyze the complete data set using the robust GEE method [5]. The results are summarized in Table 3. Based on the complete data, all the covariate effects estimated by the robust GEE method are significant at level 0.05. Furthermore, EDU, UNION, EXP and TENU show positive effects on the wage, while SOUTH and BLACK present negative effects. But the estimated TENU effect by the non-robust method is insignificant, which indicates the influence induced by some outliers in the data. To further explore possible outliers in the data set, we calculate the weights wherer ij = (Y ij − X ijβ )/φ) 1/2 withβ is the robust GEE estimator by He et al. [5] andφ is a consistent estimator of the variance of random errors, w(·) and ψ are defined in Section 3.
Small weights indicate that the corresponding observations are possible outliers. We investigate 10 observations with smallest weights and find that these observations seems to seriously deviate from the previous findings on the effects of the covariates. For example, observation 1604 has extremely low EDU value 6 which is below 0.5% quantile of the EDU values in the sample. Its EXP and TENU values are close to their medians in the sample, and its corresponding subject is black and from the South. According to the previously analysis, the observation with such covariate values is expected to have low wage. But observation 1604 has extremely high wage value 4.0 which is beyond 99.5% quantile of the wage values in the sample.
To illustrate the application of the proposed method for missing data, we artificially produce dropouts for the subjects based on the following dropout model:   where γ = (γ 0 , γ 1 , γ 2 ) T is taken to be (1, −0.8, 0.8) T , which will result in about 20% missing rate. We analyze the data with dropouts using the proposed method, inverse probability weighting GEE (IPW-GEE), complete-case GEE (CC-GEE) and IPW-QIF methods, respectively. Table 3 also presents the results of the estimates for the regression coefficient based on the incomplete data. The proposed method provides consistent significant results with the robust GEE method based on the complete data. Moreover, there exists obvious numerical difference between the proposed robust method and its non-robust version although they both get consistent significant results. However, the other methods miss at least one significant covariate. Like the findings in the simulation studies, the IPW method is sometimes instable and may result in larger standard errors due to the weighting and the CC method which ignore the dropouts generally produce inconsistent estimates. Table 4 shows the estimate for the correlation matrix based on the estimated covariance matrix. The estimated correlation matrix by the proposed robust method indicates higher correlation among the five observations within each women.
In summary, the proposed method is helpful to provide more reliable and accurate results due to its strength in dealing with outliers and missing data, and enables us to gain more useful information inherited in the data for its ability to achieve estimation of both mean and covariance.

Discussions and future work
In this paper, we develop a robust approach for estimation of both mean and covariance components for longitudinal data with dropouts. The proposed method is against outliers in the data, and is able to achieve consistent estimates of the parameters of interest in the case of dropouts without need to specify the model for the missing data process. To implement the proposed method, the values of each covariate at the first observation are required to be different. To handle this issue, we can use the estimate proposed in Qu et al. [16] to get an initial estimate of β 0 instead of β f . In this way, the final estimate is still consistent and is expected to possess some robustness. However, the method proposed by Qu et al. [16] is not robust which may influence the prediction of ij . Therefore, the performance of this alterative method needs to be assessed and can be investigated in the future study.
We here assume that all the subjects share the same covariance matrix which can be generalized to more general cases, where the covariance matrix depends on some covariates using the same idea through modeling the relationship between the covariance parameters and some covariates. Further work and additional efforts are necessary to study this topic.
The breakdown properties of the proposed robust estimator is yet another interesting topic. Zhang and Li [25] studied the breakdown properties of location M-estimators for independent data. However, it may be more complicated here since we study the robust estimation of the mean and covariance simultaneously for longitudinal data with missing values. This requires further careful investigation as a future study.