Variable Selection in a Log–Linear Birnbaum–Saunders Regression Model for High-Dimensional Survival Data via the Elastic-Net and Stochastic EM

The Birnbaum–Saunders (BS) distribution is broadly used to model failure times in reliability and survival analysis. In this article, we propose a simultaneous parameter estimation and variable selection procedure in a log–linear BS regression model for high-dimensional survival data. To deal with censored survival data, we iteratively run a combination of the stochastic EM algorithm (SEM) and variable selection procedure to generate pseudo-complete data and select variables until convergence. Treating pseudo-complete data as uncensored data via SEM makes it possible to incorporate iterative penalized least squares and simplify computation. We demonstrate the efficacy of our method using simulated and real datasets.


INTRODUCTION
Fatigue failure happens when a material is subject to fluctuating stresses and strains. Statistical models for the fatigue process describe randomly varying failure times of fatiguing materials. Birnbaum and Saunders (1969) derived a model coined the BS model showing that fatigue failure is caused by the development and growth of cracks from cyclic loading. Desmond (1985) provided a more general derivation based on a biological model. There are many advantages in using the BS model in reliability analysis. For example, this model fits well within the extremes of the distribution, even when there is little lifetime data.
Let T be a nonnegative random variable. The distribution of T is a Birnbaum-Saunders distribution with shape parameter α and scale parameter β, that is, T ∼ BS(α, β), if the cumulative density function (c.d.f) of T is given by where is the standard normal c.d.f. The probability density function (p.d.f) of T is given by where t > 0, α > 0, β > 0. This p.d.f is unimodal and its median is β.
For complete data without censoring, Saunders and Birnbaum (1968) derived maximum likelihood estimators (MLEs) of α and β and their asymptotic distribution was obtained by Engelhardt, Bain, and Wright (1981). Although the MLEs have several optimal properties, it is difficult to get the closed form for the estimator of β. For this reason, Ng, Kundu, and Balakrishnan (2003) proposed modified moment estimators (MMEs) for α and β and a bias correction technique. They used a jackknife procedure to reduce the bias of the MLEs and MMEs. For censored data, Desmond (1982) studied MLE for both Type I and Type II censored data from the BS distribution. Wang, Desmond, and Lu (2006) considered parameter estimation for the BS distribution under random censoring.
In fact, model (3) can be expressed as a BS regression model as follows: where δ = exp(ε) ∼ BS(α, β = 1). Therefore, model (3) and (4) are equivalent. But model (3) is linear with respect to X, and is easier to interpret than model (4). Desmond, Rodríguez-Yam, and Lu (2008) implemented an EM algorithm to fit a log-linear regression model with censored data when failure times follow the BS distribution.
In reliability and survival analysis, there are many examples of data where the number of covariates are often greater than the sample size, which is called a small n and big p problem. For example, a dataset corresponding to electron-probe X-ray micro analysis of archaeological glass vessels in Janssens et al. (1998) has a sample of size 180 glass vessels and 1920 spectrum frequencies as predictors. The breast cancer dataset in Sørlie et al. (2003) contains 115 female patients, including 77 patients whose survival times were censored during the follow-up, and 549 gene expression measurements as predictors.
We now turn to variable selection methods for these highdimensional problems with p n. Most variable selection criteria are closely related to penalized least squares and penalized log-likelihoods. Some traditional variable selection criteria such as AIC and BIC requiring subset selection have some drawbacks. They ignore stochastic errors inherited at the stage of variable selection and also lack stability (Breiman 1996). To retain important variables and to avoid the instability of the subset selection, Tibshirani (1996) proposed the LASSO with L 1 penalty. Efron et al. (2004) further provided deep insights into procedures of the LASSO and least angle regression. Hoerl and Kennard (1970) considered the L 2 penalty, which yields a ridge regression to handle the ill-conditioned design matrix. Zou and Hastie (2005) combined the L 1 penalty and the L 2 penalty into an elastic-net penalty, which thus enjoys flexibility of being a more LASSO-like or more ridge-like solution, depending on the choice of tuning parameters. Fan and Li (2001) suggested the use of smoothly clipped absolute deviation (SCAD) penalty. The SCAD is an improvement over the L 0 , L 1 , and L 2 penalty.
Variable selection has been widely used in various survival regression models. To name a few, Huang, Ma, and Xie (2006) considered the LASSO and the threshold gradient directed regularization for simultaneous variable selection and estimation in the accelerated failure time model with high-dimensional covariates, based on Stute's weighted least-squares method. Hu and Lian (2013) used the LASSO and SCAD penalties for variable selection in a partially linear proportional hazards model. These variable selection methods can identify the important variables among all the covariates. We are motivated to conduct this research by a few reliability and survival datasets such as the two datasets aforementioned and recent developments of the Birnbaum-Saunders model. To the best of our knowledge, there is no work on variable selection in Birnbaum-Saunders regression models in the literature. The contribution of this article is to extend the elastic-net penalty to the BS regression model.
Censoring is a common problem in survival data analysis. Broadly speaking, censoring occurs when some lifetimes are known to have occurred only within certain intervals. Here, we only consider right censoring. In this case, data can be rep-resented by pairs of random variables . . . , n; i indicates whether lifetime T i corresponds to an event ( i =1) or is censored ( i =0); C i represents censoring time.
The expectation-maximization (EM) algorithm formalized by Dempster, Laird, and Rubin (1977) is a well-known tool to deal with censored data. The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the expectation and maximization steps. However, calculating the conditional expectation required in the E-step of the algorithm may be infeasible, especially when there are multi-dimensional expectations over the censoring ranges. There are also problems in obtaining maxima. The converging point is sensitive to the initial value and sometimes leads to local maxima or saddle points of the log-likelihood function. To overcome the above limitations of EM, Diebolt, Ip, and Olkin (1994) developed the stochastic EM algorithm (SEM) algorithm. The idea of SEM is that the expectation can be estimated by simulation. Since SEM was proposed, several authors have applied it to solve their problems. Tregouet and Tiret (2004) applied SEM to estimate haplotype effects in censored data analysis using a standard Cox proportional hazards formulation. Chauveau (1995) also extended SEM for censored mixture models and studied its asymptotic behavior. Using SEM, we can draw pseudo-samples from their conditional density given the observed data and current estimates of the parameters. Replace the censored data by the pseudo-samples so that we can get pseudo-complete data. Then, the problem can be treated as in uncensored data case.
In this article, we consider simultaneous estimation and variable selection in model (3) for high-dimensional survival data, where the survival or failure time T may be right censored and the number of covariates p is greater than the sample size n. Under either case of censored or uncensored data and p n, to the best of our knowledge, simultaneous estimation and variable selection in model (3) has not been studied in the literature. Our research aims to fill this gap.
The remainder of this article is organized as follows. In Section 2, we first review the existing methods for estimation with uncensored data when dimensionality is low. Then, we study the methodology for simultaneous parameter estimation and variable selection. The coordinate descent algorithm and the stochastic EM algorithm for the implementation of the proposed methodology are described in Section 3. Simulation studies and real data examples are presented in Sections 4 and 5, respectively. Some concluding remarks are provided in Section 6. Finally, some additional materials and the R codes for the simulation studies and data analysis are given in the online supplementary material.

Parameter Estimation for Uncensored Data When Dimensionality Is Low
In Section 1, we introduced the BS regression model and its equivalent: the log-linear model (3). Here we first discuss the estimation of the parameters θ and α in the log-linear model when dimensionality of x is low and fixed, that is, p < n and data are uncensored.
Let {(y 1 , x 1 ), (y 2 , x 2 ), . . . , (y n , x n )} be n independent observations from model (3), y i = log(t i ), t i are uncensored failure times. Then, the log-likelihood function that Rieck and Nedelman (1991) considered for this model is given as follows: where and The sinh and cosh represent hyperbolic sine and cosine functions which are defined as sinh(x) = (e x − e −x )/2, and cosh(x) = (e x + e −x )/2. To get the maximum likelihood estimator, we differentiate the likelihood function l 0 with respect to θ j and α and set them equal to zero. Thus, we obtain the following equations: where p is the number of covariates, and So, the expression for the MLE of α 2 in terms of the MLEθ of θ by solving Equation (9) iŝ Since there is no closed-form solution for Equation (8), we compute the estimate of θ using ordinary least squares (OLS). Rieck and Nedelman (1991) proved that the least-squares estimator of θ with uncensored data provides an unbiased estimator for θ and is highly efficient for small values of α. They showed that the success of least-squares regression is due to the approximate normality of the sinh-normal when α is small. They also mentioned that the efficiency decreased as α increased. However under many commonly occurring conditions, it was felt that the value for α would be less than 1. For model (3), E(ε i ) = 0 and var(ε i ) = 4ω(α) (i = 1, . . . , n). There is no closed-form expression for ω(α) but it could be estimated from data. If the observations y 1 , . . . , y n are independent, then cov(ε i , ε j ) = 0, i = j , (i, j = 1, . . . , n). With these conditions on the ε i 's, the OLS of θ is the best linear unbiased estimator of θ : where Y is the column vector of the y i 's and X = (x 1 , . . . , x n ) . An unbiased estimator of ω(α) is given byω ( To improve efficiency, we can use a numerical procedure to calculate the MLE, by taking the least-squares estimate of θ as the initial value, and the initial value of α calculated by (10). Here, we show the procedure by using Fisher's method of scoring. It is equivalent to an iterative ordinary least-squares procedure. Rieck and Nedelman (1991) obtained the following system of equations at the mth iteration: where The iterative equation for α is Using these equations, we can use the iterative least-squares method to get the MLEs of (θ, α) as follows: 1. Obtain the initial value θ (0) from Equation (11). Obtain α (0) from θ (0) using the square root of Equation (10). 2. Set m = 1. 3. Calculate R and Y * using θ (m−1) and α (m−1) . 4. Calculate θ (m) and α (m) by Equations (12) and (13), respectively.
small, stop iteration and proceed to Step 7. Otherwise, go to Step 6. 6. Add 1 to m and go to Step 3. Continue until convergence is attained. 7. θ (m) and α (m) are the MLEs of (θ, α).
The above procedure works well for uncensored data with n > p. But for high-dimensional data (i.e., n < p) and censored data, the existing methods are not directly applicable. Instead we propose a regularization method based on the elastic-net penalty for simultaneous variable selection and parameter estimation.

Elastic-Net Penalty and Regularization Method
In general, parameter estimation and variable selection can be implemented by the penalized log-likelihood as follows: where P τ (·) is penalty function and λ ≥ 0 is a tuning parameter controlling the degree of regularization. In this article, we consider the elastic-net penalty given by where τ is another tuning parameter, 0 ≤ τ ≤ 1. The elasticnet penalty provides a compromise between the ridge penalty (τ = 0) and the LASSO penalty (τ = 1). The first term ensures democracy among groups of correlated variables, whereas the second term enforces the sparsity of the solution. By choosing the tuning parameter τ , we can tune elastic-net to be more ridge-like or more LASSO-like. The elastic-net also incorporates the advantages of LASSO and ridge regression and does group selection. It is particularly useful in the p n situation, or any situation where there are many correlated predictor variables. Zou and Hastie (2005) used simulation results to demonstrate the good performance of the elastic-net and its superiority over the LASSO. The elastic-net produces a sparse model with good prediction accuracy, while encouraging a grouping effect. However, computation based on the above penalized loglikelihood is very complicated. A solution to this is to iteratively use penalized least squares instead of penalized log-likelihood. This idea was used by Rieck and Nedelman (1991) under the case of p < n and uncensored data, where they did not consider any penalty and variable selection. Specifically, our new method uses the iterative least-squares procedure to construct the penalized least-squares objective function defined as By minimizing Equation (15), we can get a regularized estimate of θ at the mth iteration. Then, we can also update α by using the estimate of θ as we did using (13) in Section 2.1. Since it is a problem of penalized least squares at each iteration, the coordinate descent algorithm can be easily used to obtain a minimizer of Equation (15), for example, by using the R package glmnet. We also want to select the optimal tuning parameter λ to minimize risk in estimation. All these are discussed in the online supplementary material.

Stochastic EM Algorithm for Censored Data
For censored data, following the definition given in Section 1, we treat the problem as a typical missing data problem, where T = {T 1 , . . . , T r , T r+1 , . . . , T n } is the complete data and The observed data likelihood function is The complete data likelihood function, often admits a closed form expression or simple estimating equation for the corresponding MLE, θ ML (Y, Z), while the observed data likelihood function given by (16) does not. The information about the missing data is completely embedded in the conditional density of density Z given C, X, and θ (m) (Diebolt, Ip, and Olkin 1994): The main idea of the stochastic EM is to "fill-in" the missing data with a single draw from Equation (17), that is, the conditional density given the observed data including covariate X and θ (m) , a current guess of the parameter. Specifically, in each iteration, we generate data by SEM to replace the censored observation so that we get pseudo-complete data. This is called the S-step. Once we have a pseudo-complete sample, we can directly maximize its complete data log-likelihood to obtain an updated maximizer θ (m+1) . This is called the M-step. In summary, the following are the steps of estimating θ and selecting variables using SEM and iterative penalized least squares based on elastic-net penalty.
1. First, treat observed data as complete data and use iterative penalized least-squares estimation procedure to get the penalized MLE of (θ , α). Use the MLE as the initial values θ (0) and α (0) . (17) given the current estimation of parameters θ (m−1) and α (m−1) . 4. With the pseudo-complete sample {T 1 , . . . , T r , Z r+1 (m) , . . . , Z n (m) }, we update θ (m−1) and α (m−1) to θ (m) and α (m) by the iterative penalized least squares method. 5. Add 1 to m and repeat Steps 3 and 4. 6. After a burn-in period of m 0 iterations, the sequences θ (m) and α (m) are expected to be close to its stationary regime and we stop after a sufficiently large number of iterations M. 7. Calculate the average of (M − m 0 ) estimates, denoted by θ and α, which are the final estimates of θ and α.
In our numerical studies, we take M = 300 and m 0 = 100. According to the properties of an ergodic Markov chain, the averaged estimates will converge to the true parameter values with probability one when M → ∞. The final output from the stochastic EM is a sample from a stationary distribution whose mean is close to the MLE and whose variance reflects the information loss due to missing data. In addition to the iterative penalized least squares, the stochastic EM algorithm for censored data involves another layer of iteration and requires more computing time, but the algorithm is quite simple and avoids a problematic regularized minimization procedure based on penalized log-likelihood function, because the whole problem can be treated as in the complete data case using penalized least squares, which can be easily implemented, for example, by the R package glmnet.

Variable Selection by Multiple Testing in Stochastic EM
In our numerical analysis with censored data, in each simulation run involving SEM algorithm, we get M − m 0 = 200 sets of coefficient estimates from 200 iterations which may give us different variable selection results. To select a final set of variables from all these results, we adopt the idea of a hypothesis test for a proportion. For each simulation setting, we calculate the proportion of nonzero estimates for each variable in the 200 iterations; denote the proportions as π = { π 1 , . . . , π p }.
For a one-proportion test with null hypothesis H 0 : π = π 0 versus alternative hypothesis H a : π < π 0 , the test statistic of one-proportion z-test is where π is the estimated proportion. In our case, we treat all variables whose proportions of nonzero estimates are statistically less than 0.5 as nonsignificant variables. Then, our hypothesis is H 0 : π = 0.5 versus H a : π < 0.5. For each individual test, we calculate the value of the test statistic from (18), which equals Z calc = ( π i − 0.5) √ M − m 0 /0.5, and p-value from p i = P (Z ≤ Z calc ), Z ∼ N(0, 1). We treat the whole test procedure as a multiple testing problem. We use the method of Benjamini and Hochberg (1995) to control the false discovery rate and calculate adjusted p-valuesp i . This can be implemented by using the R function p.adjust. We set the test size at α = 0.05, thus wheñ p i < 0.05, we reject the null hypothesis and conclude that the ith variable is not significant; otherwise, it is significant. Using this method, we can decide how many variables and which variables to be selected in each simulation. After m simulations, we average the number of selected variables and round it to the nearest integer S, which is taken as the final number of selected variables. Then, we calculate the proportion of nonzero coefficients for each variable in m simulations. By ranking the proportions from highest to lowest and selecting the first S ones, we get S significant variables on average. For complete data without censoring, we do not need SEM, so the final result is based on a single cycle of iterative penalized least squares. This method of determining significant variables will be used in both simulation and real data analyses.

Residual Analysis for Model Checking
To study departures from the error assumptions as well as presence of outlying observations in the BS regression model, Leiva et al. (2007) considered two kinds of residuals: standardized deviance component residuals and standardized martingaletype residuals. We use their method to check if the BS regression model is appropriate in a specific application. We demonstrate this using two real-data examples in Section 5.

Design of Simulation Study
The purpose of the simulation study is to test the performance of simultaneous parameter estimation and variable selection in the Birnbaum-Saunders regression for high-dimensional uncensored and censored data, respectively. We also simulate grouped or correlated data to see if the elastic-net is a better variable selection procedure than LASSO when covariates are grouped or correlated.
We simulate data from model (3). To generate ε ∼ SN(α, 0, 2), according to Rieck (1989), first we generate T ∼ BS(α, 1), then let ε = ln(T ). Following Ng, Kundu, and Balakrishnan (2003), we generate T from a normal random variable by using the following transformation: where V is normally distributed with mean zero and variance 1/4α 2 . In generating pseudo-responses from the conditional density (17), we use the acceptance-rejection method. We present the following three examples to illustrate our method under different design matrices X and different censoring rates. We first generate high-dimensional uncensored and censored datasets in Example 1. In this example, we use LASSO for variable selection. To observe the difference between the LASSO and the elastic-net, we generate data with grouping effects in Example 2, and data with correlated covariates in Example 3.
In Section 4.2, we will present the simulation results based on 100 simulation runs for the Examples 1 and 2. If the model and the model assumptions are violated, it is natural to ask whether the proposed method is still reliable. We conduct simulations to assess the effect of model misspecification on the inference. The results indicate that when the model assumptions are violated, the proposed method is still reliable. This part and the results of Example 3 are included in the online supplementary material.

Examples for Parameter Estimation and
Variable Selection

Example 1: Parameter Estimation and Variable Selection With Uncensored and Censored Data.
We test the performance under different censoring rates: cr = 0% (i.e., uncensored data), 16%, 29%, and 41% when α = 0.5 and α = 1.5, respectively. We use five measures to evaluate the variable selection performance: • #S refers to the average number of selected variables. We want #S to be close to the true number of nonzero parameters. • #F N refers to the average number of incorrectly excluded variables. An incorrectly excluded variable means the variable should be selected, but our result shows that this variable is not significant. • L 1 -loss is defined as the L 1 norm of the difference between the target value θ 0 and the estimated valueθ, in the form of θ − θ 0 1 . • L 2 -loss is defined as the L 2 norm of the difference between the target value θ 0 and the estimated valueθ, in the form of θ − θ 0 2 . • Mean squared error (MSE) is defined as p i=1 ( θ i − θ 0i ) 2 /p. For each simulation, we will get a mean squared error. We record the median of MSE from all the simulation runs.
For #F N, L 1 -loss, L 2 -loss, and median of MSE, we want them to be as small as possible.
Tables 1-3 show the results obtained from Example 1. In Table 1, we only show the estimation of α, the intercept θ 1 and nonzero parameters θ 2 to θ 5 . Table 2 shows the five measures under different settings. Values reported in the bracket are standard deviations (SD). Rounding all the #Ss to the nearest integer, we report all the IDs of selected variables based on the rank of the percentages of being selected under each setting in Table 3.
The estimation for uncensored data (cr = 0%) when α = 0.5 is quite good for both α and θ. When data are censored, the estimation bias and standard deviation are larger than those for uncensored data, and increase when the censoring rate increases. When α = 1.5, the estimation for α has a large bias. This is reasonable since we mentioned in Section 2.1 that the efficiency of least-squares estimator decreases as α increases. The poor estimate of α also affects the estimate of θ. Rieck and Nedelman (1991) observed the same phenomenon in the case of low-dimensional data.
When α = 0.5, the number of selected variables is around 12 while the number of true nonzero variables is 4. According to Table 3, when we rank variables according to the magnitude of proportions of selected variables from the largest to the smallest, the true significant variables #2 to #5 have the highest rank. Thus, although more variables are selected, the true significant variables are most likely to be selected and have the highest rank. As reported in Table 2, for all the five measures except #S, they are small and have small standard deviations, which shows our method does well in parameter estimation and variable selection. L 1 -loss and L 2 -loss increase as the censoring rates increase. This pattern is consistent with the bias shown in Table 1.
When α = 0.5 is increased to 1.5, more variables are selected and the bias becomes larger. Unlike the case of parameter estimation, variable selection does not produce noticeably different results compared to that under α = 0.5. In fact, all the true significant variables are still selected with high probability and have the highest rank as shown in Table 3.
Comparing variable selection for uncensored and censored data in terms of the results from cr = 0% and results from other censoring rates, we do not see big differences from the  perspective of #S and #F N. This shows that our method in handling censored data is reasonable and effective; the performance based on the generated pseudo-complete data is close to that based on the uncensored data.

Example 2: Variable Selection for Censored Data With Grouping Effects.
To compare variable selection with grouping effects by using the LASSO and the elastic-net, we set the tuning parameter at τ = 1 and leave it unspecified respectively, then use the cross-validation to select the unspecified λ or τ . Here, we restrict the tuning parameter τ to be greater than 0.5. When τ is less than 0.5, the penalized function is more ridge-like, which will select lots of variables with large standard deviations. The shape parameter α is fixed at 0.5.
In this example, there are 15 true significant variables (excluding the intercept). Table 4 shows the selected variables. Both LASSO and elastic-net select all the significant variables in the highest ranks. Besides, elastic-net tends to select more variables than LASSO. In the sense of L 1 -loss, L 2 -loss, and median of MSE shown in Table 5, elastic-net is slightly better than LASSO. This may be attributed to more variables being selected by the elastic-net. Zou and Hastie (2005) conducted a similar simulation study when data are complete and obtained analogous results.
In conclusion, our method can provide good parameter estimation and variable selection simultaneously in model (3) for high-dimensional censored data. Particularly, when α is smaller than 1, not only variable selection but also parameter estimation has pretty good performance. Both LASSO and elastic-net can select all the significant variables with high probability, but tend to select more variables than true variables. We cannot see any obvious difference between LASSO and elastic-net based on our simulation studies.

REAL-DATA ANALYSIS
In this section, we apply our method to two datasets. These datasets represent industrial data and medical survival data, respectively. The first one is the vessel data, which is a high-dimensional uncensored dataset from chemical analysis. We apply the proposed elastic-net method to select important variables. The second one is a high-dimensional censored dataset from DNA microarray experiments. For the censored data, we first generate pseudo-complete data by using the stochastic EM algorithm, then select important variables based on the proposed method for uncensored data analysis.

Vessel Data
Variable selection is particularly useful in the field of chemometrics, where hundreds or even thousands of spectra need to be analyzed. We examine a situation in which archeological glass vessels from the 16th and 17th centuries were investigated through chemical analysis. The aim of this study was to learn more about the production of these vessels, particularly regarding their origin and possible trade connections between known producers. Maronna (2011) proposed a robust ridge regression for highdimensional data. He analyzed this dataset to demonstrate the advantage of his approach but he did not consider variable selection. Following Maronna (2011), we consider this dataset corresponding to electron-probe X-ray microanalysis of archeological glass vessels (Janssens et al. 1998), where each of n = 180 vessels is represented by a spectrum of 1920 frequencies.
For each vessel, the contents of 13 chemical compounds are registered. Since the frequencies below 15 and above 500 have almost null values of x ij , we keep frequencies 15-500, so that we have p = 486. This dataset does not contain censored observations. We choose the content of the 11th chemical compound as the response T. We can treat T as an event time since it is positive. According to our model, the response Y is a logarithmic transformation of the original chemical compound T. Our preliminary analysis shows that the frequency covariates in the dataset are highly correlated, hence, it is more desirable to use the elastic-net penalty than the LASSO penalty to select correlated variables. We first use the cv.glmnet function in the R package glmnet to select the tuning parameter τ based on  a linear regression model and obtain τ = 0.1. This small value of τ may be a result of high correlation between the frequency covariates. Using this τ , we apply the proposed penalized iterative least-squares procedure for variable selection. In total, 145 variables are selected from 486 spectra. This large number of selected variables is probably due to highly correlated covariates, such that they are selected by elastic-net in a group. But the number of selected variables is less than the sample size n = 180, therefore it still gives us feasibility to use a variety of standard statistical methods. Figure 1 shows the normal probability plots for the residuals calculated by the method of Leiva et al. (2007) based on the selected covariates. It suggests that the BS model is a good fit to the vessel data. Compared to other approaches to analysis of this dataset, Maronna's (2011) robust ridge regression does not reduce the dimension of the covariate space, while our method can select important variables and estimate parameters simultaneously. Consequently, it is preferred in applications.

Breast Cancer Gene Expression Data
This dataset was studied by Sørlie et al. (2003) and contains 549 gene expression measurements, exit time and exit status in a study of breast cancer among 115 women. Seventy-seven observations are censored, resulting in about a 67% censoring rate. Using the proposed methodology, we implement the SEM algorithm and variable selection by the elastic-net simultaneously. We use the cv.glmnet function in the R package glmnet to select the tuning parameter in each SEM iteration step of the 300 iterations, and obtain an average τ = 0.998 after removing 100 burn-in samples, which indicates LASSO is preferred in variable selection in this dataset. Using the multiple testing method given in Section 3.2, we select 71 significant genes out of 549; the estimate of the shape parameter isα = 0.070. By the method of Leiva et al. (2007) based on the selected covariates, the normal probability plots for residuals shown in Figure 2 indicate that the BS regression model is an adequate fit to the data. Gorst-Rasmussen and Scheike (2011) analyzed this dataset using LASSO in the additive hazards model. They selected only four variables. Interestingly, we do not have any selected variables in common. When we fitted a BS regression model with those four covariates, it ended up with a poor fit and a large amount of variation in the response variable cannot be explained by the four covariates. Our method selected more variables, improved goodness of fit and increased model predictability. Hence, it is an adequate approach in analyzing this dataset.

DISCUSSION AND CONCLUSION
We have proposed a regularization method for simultaneous parameter estimation and variable selection in the log-linear Birnbaum-Saunders regression model. When survival times are censored, we successfully impute the censored values by the stochastic EM algorithm. Our methodology shows good performance in both simulation and data analysis. The dimension can be efficiently reduced and the parameters can be adequately estimated. This valuable approach can be broadly applied to high-dimensional industrial and survival data which may contain a censored survival response. The elastic-net variable selection method can deal with data with or without grouping effects. Moreover, the simulation studies suggest that the proposed method is robust to model misspecification and can improve efficiency and accuracy in estimation when the BS model is correctly specified.
In future research, we can improve our method from three perspectives: (i) From the method perspective, we can bring in more variable selection penalty functions and compare them with elastic-net and LASSO, such as the smoothly clipped absolute deviation (SCAD) (Fan and Li 2001) and the smooth integration of counting and absolute deviation (SICA) (Lv and Fan 2009). These existing methods, including LASSO and elasticnet, are designed for either variable selection or shrinking regression coefficients for highly correlated covariates in a group, but not really for group selection. In the context of uncensored data, Huang et al. (2009) proposed a group bridge approach that is capable of simultaneous selection at both the group and within-group individual variable levels. Their approach uses a specially designed group bridge penalty. We speculate that for variable selection in the BS model with censored data, the group bridge has superior performance in group and individual variable selection to the LASSO and elastic-net methods. For some commonly used survival models, this has been shown to be true, for example, in the Cox proportional hazards model investigated by Huang et al. (2014). We will investigate this issue for the BS model in our future research. (ii) From the simulation perspective, we may find alternative methods to deal with censored data other than the SEM algorithm. Since there are lots of iterations within SEM computation which needs a long time to get the results, more efficient approaches would be desirable. The ascent-based EM or ascent-based MCEM algorithm by Caffo, Jank, and Jones (2005) seems an ideal approach. However, like generalized EM or MCEM, the ascent-based EM or MCEM algorithm operates on the so-called Q-function or an approxi-mateQ-function, which requires a calculation of the expected value of a complicated log-likelihood function, conditioning on the observed data and current estimates of the parameters. The variable selection could not be accomplished directly by the existing software package such as glmnet. In our future research, we will investigate the ascent-based EM or MCEM based on the Q-function of the BS model. A new computing method needs to be developed for this. (iii) From our simulated examples, we did not see a big difference between the elastic-net and the LASSO. Zou and Hastie (2005) included a typical example to show the important differences between these two methods. They generated groups of covariates with different levels of correlation. In this way, they got variables with high within-group correlations and low between-group correlations. Their simulation results showed that elastic-net performed better than LASSO and behaved with "oracle"-like properties. There may be also typical examples to compare elastic-net and LASSO in our situ-ation. All these are worthy of further investigation in our future research.