Regression Analysis of Additive Hazards Model With Latent Variables

We propose an additive hazards model with latent variables to investigate the observed and latent risk factors of the failure time of interest. Each latent risk factor is characterized by correlated observed variables through a confirmatory factor analysis model. We develop a hybrid procedure that combines the expectation–maximization (EM) algorithm and the borrow-strength estimation approach to estimate the model parameters. We establish the consistency and asymptotic normality of the parameter estimators. Various nice features, including finite sample performance of the proposed methodology, are demonstrated by simulation studies. Our model is applied to a study concerning the risk factors of chronic kidney disease for Type 2 diabetic patients. Supplementary materials for this article are available online.


INTRODUCTION
The proportional and the additive hazards (AH) models are two popular frameworks in biomedical studies for investigating the association between risk factors and disease occurrence or death (Cox 1972;Cox and Oakes 1984;Aalen 1989;Huffer and McKeague 1991;Lin and Ying 1994). The AH model provides a characterization of the effects of risk factors different from the proportional hazards (PH) model, and has remarkable features that are not found in the latter model. In particular, the AH model pertains to the risk difference, which is especially relevant and informative in epidemiological and clinical studies. Similar to the PH model, the AH model also has sound biological and empirical bases (Breslow and Day 1987).
There are many situations in medical, behavioral, and psychological research settings when outcomes or potential risk factors cannot be measured by a single observed variable. Instead, they should be characterized by several observed variables from different perspectives. Examples include glycemia measured by glycated hemoglobin (HbA1c) and fasting plasma glucose (FPG); lipid characterized by total cholesterol (TC), high-density lipoprotein cholesterol (HDL-C), and triglycerides (TG); and behavioral problem assessed by antisocial, anxious, dependent, headstrong, and hyperactive behavior. The conventional PH and AH models manage latent factors by incorporating one or more of their surrogates into the regression analysis. However, problems may exist with the separate inclusion of partial surrogates of the latent factors. First, important attributes of Deng Pan is Lecturer, School of Mathematics and Statistics, Huazhong University of Science and Technology, Wuhan, China (E-mail: pand.whu@gmail.com). Haijin He (E-mail: s115500892@sta.cuhk.edu.hk) is Post-doctoral Fellow, and Xinyuan Song (E-mail: xysong@sta.cuhk.edu.hk) is Associate Professor, Department of Statistics, The Chinese University of Hong Kong, Hong Kong, China. Liuquan Sun is Professor, Institute of Applied Mathematics, Chinese Academy of Science (CAS), Beijing 100190, China (E-mail: slq@amt.ac.cn). This research was supported by GRF 404711 from the Research Grant Council of the Hong Kong Special Administration Region, the National Natural Science Foundation of China Grants (no. 11471277, 11231010, 11171330, and 11021161), Key Laboratory of RCSDS, CAS (no. 2008DP173182), and BCMIIS. The authors are thankful to the editor, the associate editor, two anonymous reviewers, and Professor Sik-Yum Lee for their valuable comments and suggestions that improved the article substantially, to Dr. Feng C. for pointing out an error in the proof and providing methods to correct it, and to Professor Juliana Chan from the Department of Medicine and Therapeutics, and Prince of Wales Hospital of the Chinese University of Hong Kong, for providing the data in the real example. latent factors may be lost because of incomplete measurement, leading to bias or misleading results. For instance, systolic blood pressure (SBP) and diastolic blood pressure (DBP) are usually used together to characterize blood pressure. Although highly correlated, SBP and DBP reflect blood pressure from different angles and do not replace each other. If one only uses SBP (or DBP) to measure blood pressure, then the insufficient information cannot reveal the danger of too high (low) DBP (or SBP), leading to incorrect conclusion for people with normal SBP (or DBP) but abnormal DBP (or SBP). Second, multiple surrogates of a latent factor are highly correlated because they share certain similarity elicited by this latent factor. Such high correlation may cause the multicollinearity problem, resulting in incorrect estimation or divergence of the computer program. The latter example (see Table 4) shows that incorporating the individual highly correlated observed variables WAIST, BMI, SBP, DBP, HbA1c, FPG, TC, HDL-C, and TG in a standard analysis causes multicollinearity and produces confounding results. Finally, interpreting the effect of each surrogate separately may be tedious and incomprehensible. For instance, when examining the effect of lipid control on the development of chronic kidney disease (CKD), a standard analysis only tells how the individual observed variables TC, HDL-C, and TG influence the hazards of CKD but never addresses the major concern on the overall impact of lipid on CKD. Further, the diverse effects caused by the multicollinearity (Table 4) makes the interpretation even harder. By contrast, the proposed joint analysis groups TC, HDL-C, and TG together into a latent variable "lipid" and provides scientific evidence for the overall effect of this latent variable and thus makes the interpretation conceptually simple and comprehensible. Simultaneously, the dimension of covariates is also reduced substantially.
A number of studies have assessed the interdependence, causations, and correlations among multiple observed variables and latent factors. The commonly used techniques include factor analysis (Lawley and Maxwell 1971) and latent variable models (Bollen 1989;Jöreskog and Sörbom 1996;Lee 2007;Song and Lee 2012). The latent variable analysis originated in the field of psychology to relate various mental test results to latent traits, such as intelligence. Additionally, it has recently been used in medical studies to investigate the relationships between covariates (e.g., age) and a medical trait (e.g., overall treatment effect), or between the endpoint of interest (e.g., kidney failure) and various potential latent risk factors (e.g., glycemia and lipid). A typical modeling framework for such latent variable analysis consists of two models. The first model is the so-called confirmatory factor analysis (CFA) model (Bollen 1989, chap. 7;Shi and Lee 2000;Lee 2007, chap. 2) for characterizing latent factors through multiple correlated observed variables, whereas the other one is a regression type model for relating observed and latent risk factors to quantities of interest. Sammel and Ryan (1996) measured the overall severity of birth defects using multiple adverse effects and then regressed it on the exposure to anticonvulsants when analyzing the effects of anticonvulsant medication during pregnancy. Roy andLin (2000, 2002) summarized the effectiveness of treatment practices using three longitudinal outcomes and related the overall treatment effect to covariates through a linear mixed model when examining the effectiveness of methadone treatment practice in reducing illicit drug use.
In this study, we propose a joint model to analyze the survival data with latent variables. The proposed model consists of two components: a CFA model for characterizing the latent risk factors, and an AH model to relate the observed and latent risk factors to the hazard function of interest. To analyze such kinds of joint models with latent variables, published reports (e.g., Sammel and Ryan 1996;Roy and Lin 2002) have mainly used the expectation-maximization (EM) algorithm, in which the latent variables and/or random effects were treated as missing data, and the maximum likelihood estimates (MLEs) of the model parameters were obtained by maximizing the expectation of the complete-data log-likelihood function. In our proposed model, however, the AH model is nonlinear with an unspecified baseline hazard function, which makes the complete-data loglikelihood function highly intractable. Directly applying the EM algorithm to analyze the proposed model is challenging. Thus, to address such difficulty, we develop a borrow-strength estimation procedure in this study. The procedure is a two-stage approach. At the first stage, latent variables are characterized by the CFA model and estimated using the EM algorithm. At the second stage, the AH model with the estimated latent variables is analyzed using the corrected pseudo-score method (Carroll, Ruppert, and Stefanski 1995;Kulich and Lin 2000;Song and Huang 2006). To the best of our knowledge, no study has analyzed the proposed joint survival model with latent variables.
The present research is motivated by a study based on the Hong Kong Diabetes Registry, which was established in 1995 as part of a continuous quality improvement program at the Prince of Wales Hospital in Hong Kong. A 10-year prospective cohort of 3586 Chinese Type 2 diabetic patients who may have experienced CKD, one of the most common diabetic complications, was analyzed. A primary interest of this study is to investigate the potential risk factors of CKD to prevent diabetic complications and improve the quality of life for diabetic patients. However, some risk factors, such as obesity, blood pressure, glycemia, and lipid, are latent traits that cannot be measured by a single observed variable. Thus, how these latent risk factors affect the overall development of CKD is of great interest in medical research. To address this problem, we propose the current joint model with latent variables. The empirical analysis (Section 5) shows that our model provides important insights into the overall effects of latent risk factors. Such kinds of overall effects cannot be revealed by conventional AH models.
The rest of the article is organized as follows. In Section 2, we describe the proposed joint model and present the borrowstrength estimation procedure for regression parameters of interest. The asymptotic properties of the proposed estimators are established in Section 3. Section 4 presents simulation studies to evaluate the empirical performance of the proposed methods. An application to the CKD data is reported in Section 5. Section 6 concludes the article with discussion. Proofs are provided in the Appendix and other details are given in supplementary material.

Joint Models
We propose a joint model that involves two major components. The first component is the CFA model and the second component is the AH model. For i = 1, . . . , n, let V i be a p × 1 vector of the observed variables and ξ i be a q × 1 vector of latent variables (factors), where p > q. The CFA model for characterizing latent variables through multiple observed variables is defined as where B is a p × q matrix of factor loadings, ξ i ∼ N (0, ), and i is a p × 1 vector of random errors independent of ξ i and distributed as N (0, ) with a diagonal . The elements of B reflect the associations between the observed variables and their corresponding latent variables. In a CFA model, any element of B can be fixed at preassigned values. This setup allows great flexibility in selecting an appropriate structure of B to achieve a clear interpretation of the latent variables. In substantive research, we usually already have a clear idea of the number of latent variables and the structure of the factor loading matrix. For example, in the CKD study, obesity, blood pressure, glycemia, and lipid are important medical traits for assessing the health profile of diabetic patients. Our main objective is to examine their effects on the risk of CKD. Thus, the number of latent variables in this study is clearly four. Experts in diabetes and the existing medical literature (e.g., Song et al. 2009;Wang et al. 2014) suggested appropriate observed variables for these latent variables. Thus, the number of latent variables and the structure of B can be naturally decided based on the study objectives, the meaning of the observed variables suggested by subject experts, and/or the existing literature. Moreover, an exploratory factor analysis (EFA) can be used to cross validate the decision. Well-known software in structural equation modeling, such as LISREL (Jöreskog and Sörbom 1996), EQS (Bentler 2002), and Mplus (Muthén andMuthén 1998-2007), can be used to conduct the EFA. In the analysis of CKD data, we conducted an EFA using LISREL to verify the decision on the number of factors and the structure of B (see details in Section 5).
To investigate the potential risk factors of a failure time of interest, the latent variables characterized by the CFA model, along with some observed covariates, are further related to the failure time through an AH model with latent variables. Let T i be the failure time of interest, Z i be a s × 1 vector of covariates, and C i be the censoring time assumed to be conditionally independent of T i given Z i and ξ i . Let X i = min(T i , C i ) be the observed time and i = I (T i ≤ C i ) be the failure indicator, where I (·) is the indicator function. The AH model specifies that, given Z i and ξ i , the hazard function of T i takes the form where λ 0 (t) is an unspecified baseline hazard function, and β and γ are the s × 1 and q × 1 vectors of unknown regression parameters. Given that ξ i plays a role of risk factors in model (2), the observed variables in model (1) are actually multiple risk factors that are related to a latent variable (another risk factor). Thus, the proposed joint model is different from a typical one in joint modeling literature, in which several post-exposure outcome variables of interest exist. Song and Huang (2006) developed a joint analysis to accommodate longitudinal covariates measured with error in an AH model. Their first model is essentially a growth curve model with measurement error. It evaluates a time-dependent covariate through a single surrogate with measurement error at intermittent occasions. By contrast, our first model emphasizes the use of "latent variables" to assess latent traits that should be measured by several (usually highly correlated) observed variables. For example, the latent variable "blood pressure" is measured by observed variables SBP and DBP, and the latent variable "lipid" is measured by observed variables TC, HDL-C, and TG. This kind of latent variables contributes an integral part in structural equation models and has been widely used in the medical, behavioral, and social sciences (see Bollen 1989;Lee 2007;Song and Lee 2012, and references therein). Our first model is a CFA model that is used to form the latent variables in ξ i via grouping appropriate observed variables through a factor loading matrix B. In the CFA analysis, the unknown parameters in B, which can be regarded as data-driven weights to reflect how the observed variables contribute to the characterization of a latent variable, are estimated along with the correlations of the latent variables and the error variances. Clearly, Song and Huang's model and other measurement error models do not involve such kind of latent variables and the factor loading matrix. Thus, the CFA model in (1) is quite different from the growth curve measurement error model of Song and Huang (2006). In formulating the AH model, Song and Huang (2006) directly incorporated measurement error covariates into the model. By contrast, we propose the use of latent variables obtained from the CFA, rather than a large number of covariates, to model the hazard function of interest.
While our model and that of Song and Huang (2006) are different in terms of scientific goal, modeling of latent variables, and scope of applications, they share certain similarities. First, the method proposed by Song and Huang (2006) assumed mixed effects model for time-dependent covariates. Time-independent covariates with repeated measurements can be viewed as a special case with the slope equal to 0. Thus, when the B matrix is block diagonal with each block being all 1's vector, the CFA model can be viewed similarly as a measurement error model. Second, given that both methods involve variables measured with errors, they encounter similar difficulty induced by unob-servable variables and bear the same spirit of using corrected pseudo-score approach in statistical inference.
The proposed model subsumes several statistical models as its special cases. For instance, if p = q = 0, then the proposed model reduces to a simple AH model with only fixed effects. If p = q and B = I (identity matrix), then the proposed model reduces to an AH model with covariates measured with error. If one only considers model (2) and fixes the elements of γ at 1, then the proposed model reduces to a frailty AH model with both fixed and random effects. Compared with the conventional AH model or frailty AH models Cai and Zeng 2011), the proposed joint model has several appealing features. First, the analytic power of the joint model is greatly enhanced using the information from multiple indicators. Second, the joint model provides statistical evidence and attractive interpretation for the effects of latent traits that are well known but hard to measure directly. The frailty terms in conventional AH models only address the dependence among clustered or longitudinal observations, whereas the latent variables in the proposed model represent medical traits with specific meanings. These latent variables not only address the possible heterogeneity in the data but also provide insights into their impacts on the failure time of interest. Third, the multicollinearity problem in the conventional regression analysis can be eliminated in the proposed joint model because the highly correlated variables have been grouped into relatively independent latent factors through the CFA model. Finally, given that the number of latent variables, q, is often much smaller than the number of observed variables, p, the dimension of the risk factors in model (2) can be reduced significantly. (1) is not identifiable because for any nonsingular matrix L, we have

Remark 1. (Model Identification). Model
, indicating that parameters B and are not simultaneously estimable without imposing an identifiability constraint. In this study, we follow a common practice in the literature (Bollen 1989;Lee 2007) to (i) fix the diagonal elements of to 1, such that is a correlation matrix. This constraint unifies the scale of the latent variables. (ii) Fix additional elements in B at preassigned values, such that the only possible nonsingular matrix L that makes B * satisfy the constraints is the identity matrix.
Remark 2. In model (1), we make a generic assumption Let θ be a vector of the unknown parameters in model (1), andθ be the MLE of θ . Given that model (1) is a standard CFA model, the estimation of θ is relatively simple (Bollen 1989;Shi and Lee 2000;Lee and Song 2004). We first focus on the parameter estimation procedure for model (2), and then outline the EM algorithm to obtainθ in model (1).

Borrow-Strength Estimation
We define the counting process If the latent variables ξ i are observable, α = (β T , γ T ) T can be estimated based on the pseudo-score functions using the approach of Lin and Ying (1994): . However, the latent variables ξ i are unobservable in the proposed joint model. Consequently, the above pseudo-score functions are not directly applicable. Thus, to address this problem, we first estimate ξ i based on model (1). For a given θ, an estimator of ξ i , denoted byξ i (θ ), can be written as (Lee 2007, chap in the estimating functionsŨ 1 (α) andŨ 2 (α) would lead to biased estimators because the resulting estimating functions do not have zero mean because function of θ and a ⊗2 = aa T for a vector a. Thus, for a given θ, we propose the corrected pseudo-score function U(α; θ ) = (U 1 (α; θ ) T , U 2 (α; θ ) T ) T for α as follows: . For givenθ , we estimate α by solving U(α;θ ) = 0. The resulting estimator has an explicit expression: Let 0 (t) = t 0 λ 0 (u)du denote the baseline cumulative hazard function. The corresponding estimator of 0 (t) is given byˆ

Inference of CFA Model
Let V = (V 1 , . . . , V n ) and ξ = (ξ 1 , . . . , ξ n ). Given that ξ i 's are unobservable, we employed the EM algorithm to obtainθ in model (1) (e.g., Shi and Lee 2000;Lee and Song 2004). The EM algorithm consists of two steps: an E-step for calculating the conditional expectation of the complete-data log-likelihood function with respect to the distribution of ξ i given the observed data and the current value of θ , as well as an M-step for updating θ by maximizing the conditional expectation. The complete-data log-likelihood function of model (1) is Theθ can be obtained by iterating the E-step and M-step as follows. At the rth iteration with the current value θ (r) : The conditional expectations of the sufficient statistics required in the following M-step have explicit forms

M-step:
We maximize Q(θ|θ (r) ) with respect to θ to obtain a new value of θ. Let B T j be the jth row of B, φ kj be the entry in the kth row and jth column of , ψ j be the jth diagonal element of , and V ij be the jth element of V i . The vector of unknown elements in B j is denoted by B j 1 and the vector of fixed ones is denoted by B j 2 . The subvectors of ξ i corresponding to B j 1 and B j 2 are denoted by ξ i,j 1 and ξ i,j 2 , respectively. Simple calculations show that the updated estimates of B j 1 and ψ j take the following explicit forms: The updated estimates of unknown parameters in do not have closed forms, so we use the Newton-Raphson algorithm to calculate them iteratively. Let 1 be a vector of the unknown parameters in , The entries in ∂Q(θ|θ (r) ) ∂ 1 and ∂ 2 Q(θ |θ (r) ) can be calculated as The expectations required in this step are calculated in the E-step. The convergence of the EM algorithm can be monitored using a commonly used stopping criterion where δ is the predetermined small value (e.g., 0.001) and l(θ |V) is the observed-data log-likelihood function The asymptotic covariance matrix ofθ can be calculated by the inverse of the observed information matrix Given that the observed information matrix contains complicated matrix differentiation, we approximate it numerically. Let l(θ) = l(θ |V), and let θ j denote the jth element of θ , then where e j is the canonical basis vector that is one at the jth coordinate, and m j is a constant that is small relative toθ j (e.g., θ j × 10 −3 ).
Remark 3. Shi and Lee (2000) and Lee and Song (2004) also used the EM algorithm to perform the statistical inference for the CFA model. The EM algorithm they employed is actually a Monte Carlo EM (MCEM) algorithm, in which the E-step approximates the conditional expectation via the simulated latent variables and other unknowns via the Gibbs sampling. In the current study, the Monte Carlo step is unnecessary because the conditional expectation can be expressed in a closed form of the parameters. Compared with those in Shi and Lee (2000) and Lee and Song (2004), the inference for the CFA model in the current study is more efficient.

ASYMPTOTIC PROPERTIES
Given that the asymptotic properties ofθ have been well studied in the literature (Amemiya, Fuller, and Pantula 1987;Anderson and Amemiya 1988;Lee 2007, pp. 41-43), we analyze the asymptotic properties ofα, which is of our primary interest in the proposed joint model. Let , where ⊗ denotes the Kronecker product of two matrices, vec denotes the operation that converts a matrix into a column vector by stacking the rows sequentially,˙ (θ) = ∂(vec (θ)) T /∂θ ,˙ s (θ ) (s = 1, . . . , p) denote the derivatives of the sth column of (θ) with respect to θ T , andḊ r (θ ) (r = 1, . . . , q) denote the derivatives of the rth column of D(θ ) with respect to θ T . Define Let α 0 and θ 0 be the true values of α and θ . Using the consistency ofθ, we obtain the consistency of (θ ) and D(θ) (Lemma A.1 of the Appendix). Then, we can establish the asymptotic properties ofα. The results are summarized in the following theorem with the proof given in the Appendix.
In implementing the EM algorithm to estimateθ , the convergence was monitored and claimed with δ = 0.001 (see (4)). Thê α was then obtained using (3). The results presented below are based on 1000 replications with sample sizes n = 500 and 1000. Table 1 presents the simulation results on the estimate of α = (β, γ T ) T , whereas Table 2 presents the simulation results on the estimate of θ. In these tables, Bias is the sampling mean of the estimate minus the true value, SE is the sampling standard error of the estimate, SEE is the sampling mean of standard error estimate, and CP is the 95% empirical coverage probability for the parameter based on the normal approximation. The Table 2  results indicate that our proposed method performed satisfactorily for all the situations under consideration. Specifically, the estimators produced were virtually unbiased. Good agreement exists between the estimated and the empirical standard errors. The coverage probabilities of the 95% confidence intervals were close to the nominal level. As expected, the performance of the proposed estimators was improved when the sample size was increased from n = 500 to 1000.
To investigate the performance of the proposed method under the situation when the factor loading matrix B has an overlapping structure, we reconducted the above simulation by using an overlapping B as follows: where the elements with asterisk are fixed and b jk 's are unknown factor loadings whose true population values were set as 0.8. The overlapping structure of B implies that latent variables ξ 1 and ξ 2 are measured by {V 1 , V 2 , V 3 , V 4 } and {V 1 , V 4 , V 5 , V 6 }, respectively. Using the proposed estimation procedure, we obtained the results under different types of λ 0 (t) and sample sizes. The results with Type (I) λ 0 (t) and n = 500 are presented in Table  S1 of supplementary material. Those obtained under other scenarios are similar and therefore not reported. The results show that the performance of parameter estimates does not change regardless of the structure of B.
We also consider a setup that mimics the real data example of Section 5 to check the robustness of the proposed method to misspecification of B. The model and parameter values are the same as those of the CKD study except b 81 = −0.245. Based on the data generated from this new model, we conduct the analysis with a misspecified B, wherein nonzero b 81 is misspecified as zero. Table S2 of supplementary material shows that the estimation results related to the AH model are quite robust, but those related to the CFA model are slightly influenced under the case considered. Nevertheless, it is worthwhile to note that the statistical inference for the CFA is sensitive to misspecification of the number of latent factors. When fitting a wrong model with extra (nonexistent) or missing (existent) latent factors, we found that the computer algorithm always diverged.
Further, we examined the empirical performance of the proposed method in which the error variance increases. We again used the previous simulation setting but increased the error variance ψ j , j = 1, . . . , 6 from 0.3 to 1.0. The results obtained with Type (1) λ 0 (t) and n = 500 are reported in Table S3 of supplementary material. Except for slightly increased standard error estimates, the parameter estimates are still unbiased and close to the ones obtained under smaller error variance.
The results with Type (I) λ 0 (t) and n = 500 under Cases (1)-(6) are presented in Tables S4 and S5 of supplementary material. Most of the parameter estimates, especially those related to the AH model, are quite robust to the nonnormality of ξ i and i . While nonnormal ξ i mainly influence the standard error estimates of the factor loadings in B, nonnormal i mostly impact on those of the error variances in . We also checked the sensitivity of the estimation results when ξ i and i follow other nonnormal distributions. Similarly, the inference related to the AH model is very robust to the nonnormality of ξ i and i , whereas the standard errors of some parameters in the CFA model are overestimated/underestimated. Considering that the primary goal of the current study is to investigate the potential risk factors of hazards of interest, the inaccurate standard error estimates of error variances in the CFA model would not cause a problematic inference of the AH model.

APPLICATION TO CKD DATA
The proposed methodology was applied to a study of CKD for Type 2 diabetic patients as discussed in Section 1. The main objective is to investigate the potential risk factors that affect the hazards of CKD. The failure time of CKD was calculated as the periods from enrollment to the date of the first clinical endpoint or January 31, 2009, whichever came first. The failure (clinical endpoint) of CKD was defined by diabetic nephropathy (DNP) plus follow-up estimate glomerular filtration rate (eGFR) < 60 (Wang et al. 2014). The CR of CKD is approximately 73%. Possible risk factors include those relevant to the patients' characteristics, such as age at enrollment, duration of diabetes, sex (1 = female, 0 = male), waist circumference (WAIST), body mass index (BMI), SBP, DBP, HbA1c, FPG, TC, HDL-C, and TG. Based on the medical meaning of these observed variables and the existing literature (Song et al. 2009;Wang et al. 2014), grouping WAIST and BMI into a latent variable "obesity," SBP and DBP into a latent variable "blood pressure," HbA1c and FPG into a latent variable "glycemia," and TC, HDL-C, and TG into a latent variable "lipid" is natural and acceptable. Hence, in building a CFA model, we propose the number of latent variables (factors) to be four, a nonoverlapping factor loading matrix B, and a clear interpretation of the latent variables. To crossvalidate the decision, we conducted an EFA using LISREL. The estimated number of factors is four, and the estimated factor loadings reported in Table 3 clearly show that Factor 1 can be interpreted as "obesity" (i.e., large loadings associated with WAIST and BMI, and small loadings associated with others). Meanwhile, Factors 2, 3, and 4 can be interpreted as "blood pressure," "glycemia," and "lipid," respectively. In determining the factor loading matrix, we fixed the parameters associated with small loadings at zero, consequently obtaining absolutely clear interpretations of the latent variables. Therefore, a joint model with latent variables was proposed to examine how the observed risk factors, such as age, duration of diabetes, and sex, as well as latent risk factors stipulated above, influence the hazards of CKD.
Let V = (V 1 , V 2 , . . . , V 9 ) T = (WAIST, BMI, SBP, DBP, HbA1c, FPG, TC, HDL-C, TG) T , each of which is standardized prior to the analysis, ξ = (ξ 1 , ξ 2 , ξ 3 , ξ 4 ) T = (obesity, blood pressure, glycemia, lipid) T , and Z = (Z 1 , Z 2 , Z 3 ) T = (age at enrollment, duration of diabetes, sex) T . We employed models (1) and (2) with p = 9, q = 4, s = 3, and the following nonoverlapping factor loading matrix to conduct the analysis: where the zeros are fixed elements. In estimatingθ , we set δ = 0.001. Theθ andα were then obtained based on the EM algorithm and (3), and their standard error estimates were obtained based on (5), (6), and Theorem 1. The proposed joint model and the analytical results are reported in Figure 1. For clarity, the less important parameters in and are not reported. Several findings are obtained from Figure 1. First, the age at enrollment has significantly positive effect on the hazards of CKD, indicating that older patients have higher risks of developing CKD. Second, the duration of diabetes also has significantly positive effect on the hazards of CKD. Patients with longer duration of diabetes are more predisposed to CKD. Third, different from age and duration of diabetes, the effect of sex on CKD is not significant. Thus, male and female diabetic patients have equal risks of suffering this complication. Fourth, all latent risk factors have  significantly positive effects on the hazards of CKD, indicating that obesity, hypertension, higher level of blood glucose, and bad control of blood fat increase the risk of CKD. Finally, all the estimated factor loadings are highly significant, indicating strong associations between each latent variable and its corresponding observed variables. These findings have public health implications, especially for the aggressive control of risk factors to prevent CKD or other complications for Type 2 diabetic patients and to improve their quality of life. For comparison, we conducted a standard analysis by regarding the observed variables, V 1 , . . . , V 9 , as independent covariates in the AH model. The results are reported in Table 4, which shows the diverse effects of the observed indicators of a latent risk factor. For instance, the effect on the hazards of CKD is significant for WAIST but nonsignificant for BMI, positive for SBP but negative for DBP, significant for HbA1c but nonsignificant for FPG, and negative for TC, nonsignificant for HDL-C, but positive for TG. A further check reveals high correlations among the observed variables. For instance, the sample correlation between WAIST and BMI,SBP and DBP,and Hb1Ac and FPG,are 0.827,0.615,and 0.687,respectively. These high correlations elicit multicollinearity and result in diverse effects that are confounding and hard to interpret. By contrast, the proposed joint analysis avoids the multicollinearity problem and provides important insights for the latent risk factors of CKD.
Given that there is a slight disagreement between the LIS-REL analysis (b 81 = −0.245) and our specification (b 81 = 0) for the CFA model, we reanalyzed the dataset under the same setup except allowing b 81 = 0. The parameter estimates, along with those obtained under b 81 = 0, are reported in Table S6 of supplementary material. The two analyses yield very similar results. In particular, the parameter estimates related to the AH model are identical up to the first three decimals. A sensitivity analysis in Simulation 1 also demonstrated the robustness of the proposed method to such a misspecification. Although we adopted a nonoverlapping structure of B based on the suggestion of medical experts and the existing medical literature in this study, the use of overlapping or nonoverlapping B in substantive research should be decided on a problem-to-problem basis.

DISCUSSION
In this article, we propose a joint model to investigate how observed and latent risk factors influence the hazard rate of interest. The proposed model is novel and advantageous over conventional AH models in terms of both analytic power and interpretability. A hybrid procedure that combines the EM algorithm and a borrow-strength estimation method is proposed to conduct the statistical inference. The consistency and asymptotic normality of the parameter estimators are established. Simulation results demonstrate that the proposed method performed well. The novel model and methodology were applied to a real problem of identifying potential risk factors of diabetic kidney disease. Important insights for preventing such complication were obtained.
The borrow-strength estimation procedure developed in this article is essentially a two-stage approach. The first stage estimates the latent variables and the parameters of the CFA model by the EM algorithm, whereas the second stage estimates the regression coefficients of the AH model by using the estimated latent variables with corrected pseudo-score functions. A possible alternative estimation method is to implement the EM algorithm based on the complete-data likelihood function of the proposed joint model. However, this completedata likelihood function involves high-dimensional integration and unspecified baseline hazard function. The nonlinearity of the AH model makes the E-step extremely challenging because f (ξ i |V i , X i , i , θ , α, λ 0 (t)) is very complicated, and thus sampling becomes difficult. In addition, the unspecified baseline hazard function leads to a tedious maximization in the M-step. Thus, the feasibility of such complete-data likelihood-based method is uncertain and requires further investigation.
Our approach can be regarded as a generalization of the conventional AH model to a broader class of models through the incorporation of latent variables. To the best of our knowledge, our study is the first to introduce latent variables into the survival modeling framework. Further, our approach can be extended to more generalizable models. First, the current study requires the normal assumption of the latent variables and the measurement error. However, officially assessing this assumption is impossible because the latent variables are not directly observable. Although the previous simulation demonstrates that the inference of the AH model is robust to the violation of this normal assumption, it is of interest to develop novel methods that do not require such assumption. One promising direction in relaxing this assumption is considering an asymptotically distribution-free approach (Browne 1984) to obtain the parameter estimation of the CFA model and then modifying the corrected pseudo-score method accordingly to perform the statistical inference for the AH model. Second, our model can also be extended to a longitudinal setting, in which the CFA model forms time-dependent latent variables, and the AH model reveals the dynamic effects of observed covariates and latent risk factors on the failure time of interest. Third, the current univariate AH model can be generalized to a multivariate version to accommodate multivariate failure time data. The estimation procedure proposed in this article can be extended to analyze the multivariate AH model with latent variables in a straightforward manner.
Finally, the proposed joint analysis can be applied to other model frameworks, such as the PH model where λ 0 (t), β, γ , Z i , and ξ i are defined similarly as in the CFA-AH joint model. A standard analysis of PH model is based However,Ũ p (α) is not applicable because of the existence of latent variables. Again, a simple replacement of ξ i byξ i (θ) inŨ p (α) will lead to biased estimation. Here, the correction ofŨ p (α) pertaining to exponential functions of ξ i is more difficult than that corresponding to quadratic functions of ξ i . Inspired by the previous work (e.g., Nakumura 1992; Kong and Gu 1999) on PH model with covariate measurement error, we propose corrected estimating function The estimator of (α, θ ) can be obtained similarly. As expected, the derivation of asymptotic properties of the parameter estimates for CFA-PH joint model will be more technically involved. The details are not reported here. The same rationale is applicable to the additive-multiplicative hazards model that subsumes both PH and AH models as special cases. As suggested by the associate editor, considering linear transforma-tion models (Zeng, Yin, and Ibrahim 2005;Zeng, Lin, and Lin 2008) is a promising attempt for formulating the approach in a more general way, but it requires new theoretical and computational developments. The feasibility of obtaining a corrected estimating function analytically under the transformation model framework requires further investigation.

APPENDIX: PROOFS OF ASYMPTOTIC RESULTS
We use the same notation defined earlier and take all limits at n → ∞. Let s (1) z (t), s (1) ξ (t), and s (0) (t) be the limits of S (1) z (t), S (1) ξ (t), and S (0) (t), respectively. To study the asymptotic properties of the proposed estimators, we need the following regularity conditions: (C1) The matrix (θ 0 ) is positive definite; all partial derivatives of the first three orders of (θ) with respect to the elements of θ are continuous and bounded in a neighborhood of θ 0 ; the matrix˙ (θ ) is of full rank in a neighborhood of θ 0 . (C2) 0 (τ ) < ∞, P (X i ≥ τ ) > 0, and Z i is bounded almost surely. (C3) A is nonsingular, and Note that under model (1), V i is distributed as N (0, (θ)). Let and theθ is the maximizer of L(θ|V). Some straightforward calculation yields thatθ is the minimizer of the following discrepancy function: To prove Theorems 1 and 2, we need the following lemma.
Proof of Theorem 2. Note that Then using the functional central limit theorem (Pollard 1990), together with (A.2), (A.10), and (A.11), we have that uniformly on [0, τ ], In view of the consistency ofα, from the uniform law of large numbers (Pollard 1990) and the multivariate central limit theorem,