A resample-replace lasso procedure for combining high-dimensional markers with limit of detection

In disease screening, a biomarker combination developed by combining multiple markers tends to have a higher sensitivity than an individual marker. Parametric methods for marker combination rely on the inverse of covariance matrices, which is often a non-trivial problem for high-dimensional data generated by modern high-throughput technologies. Additionally, another common problem in disease diagnosis is the existence of limit of detection (LOD) for an instrument – that is, when a biomarker's value falls below the limit, it cannot be observed and is assigned an NA value. To handle these two challenges in combining high-dimensional biomarkers with the presence of LOD, we propose a resample-replace lasso procedure. We first impute the values below LOD and then use the graphical lasso method to estimate the means and precision matrices for the high-dimensional biomarkers. The simulation results show that our method outperforms alternative methods such as either substitute NA values with LOD values or remove observations that have NA values. A real case analysis on a protein profiling study of glioblastoma patients on their survival status indicates that the biomarker combination obtained through the proposed method is more accurate in distinguishing between two groups.


Introduction
Protein-based assays are a major component in developing cancer screening markers, and their expression levels can be rapidly measured via modern high-throughput technologies. Since different proteins may indicate changes in different aspects of a cancer, a composite marker developed by combining all relevant proteins can integrate all these changes and thus may improve the diagnostic accuracy. The proteomics literature has shown that a combination of multiple protein profiles tends to have a much higher sensitivity than an individual protein profile [3,8,31]. And finding the optimal combination of different biomarkers has been a topic of interest. A widely used tool to measure the diagnostic performance of the biomarker combination is the receiving operating characteristic (ROC) curve and the area under this curve (AUC). A larger AUC usually indicates better accuracy in distinguishing cases from controls.
To combine protein profiles, both parametric and nonparametric methods have been considered. Parametric methods for the linear combination of relevant biomarkers mainly rely on the multivariate normal distribution assumption [9,17,[19][20][21]25,27,32]. These methods are intuitive and are closely related to the linear discriminant analysis [27]. Nonparametric methods usually rely on numerical search for the best combination, and thus are computationally demanding when the data are generated from high-throughput techniques, which limits their range of application.
The existing parametric methods face two major challenges when dealing with highdimensional data generated by modern high-throughput technologies. The first challenge is the presence of limit of detection (LOD) when measuring proteomic markers. That is, it is difficult to measure markers at a relatively low level due to the limited detecting ability of an equipment or instrument. If a marker value of a subject is below the corresponding detection range, it cannot be observed and is assigned with the measurement of NA (or zero) [22,34]. The ultrasensitive immunosensors and arrays based on nanotechnology have been developed to handle the LOD problem. However, in point-of-care applications, the existence of LOD is still a challenging issue when using the existing statistical methods to evaluate the diagnostic accuracy [24]. For example, the LOD issue exists in a protein profiling study of glioblastoma patients on their survival status at the Center for Applied Proteomics and Molecular Medicine at George Mason University. In this study, protein profiles were measured on the glioblastoma tissues of the patients, and many of them have observations below the LOD. To handle this issue, one may impute the NA values with some constants to get a complete dataset [13]. Another approach is to remove observations with NA values and only use those complete ones to conduct estimation. However, both methods fail to make the best use of the information contained in the data and may lead to biased parameter estimators [9,20].
The second challenge is encountered when the number of proteins is larger than the sample size. The traditional setting assumes that a fixed number of proteins are profiled on a large sample. To estimate the mean vectors and covariance matrices of multiple biomarkers, a maximum likelihood estimation method (MLE) can be used [9,29]. Additionally, Tomassi et al. [28] recently proposed a likelihood-based sufficient dimension reduction method for analyzing multiple correlated markers with the LOD. However, these methods cannot be applied to high-dimensional proteomic biomarkers since the sample covariance matrices are singular [5,6,15,16,33], and thus cannot be estimated by maximizing the likelihood function which involves the matrix inverse. To overcome this issue, the literature imposed additional assumptions on the covariance matrix. For example, to ensure positive definiteness, the diagonal matrix was used in the Fisher's discriminant problem [6,10,30]. However, the assumption of independence among all the protein profiles is overly simplified and does not represent the true correlation structure. An alternative way is to add penalty terms to the likelihood function, such as the lasso-type penalty term. Besides, the lasso-type penalty imposes a sparsity constraint to the covariance/precision matrix, which is a common assumption for high-dimensional data.
Methods dealing with these two challenges in case-control studies have been scarce. A recent paper [1] considered an EM algorithm with the Gaussian graphical lasso for censored data. However, the emphasis of their method is on the relationship between gene biomarkers in an one-sample problem, and they did not consider biomarker combination for improving diagnostic accuracy. To deal with these two challenges when searching for the optimal linear combination coefficients, we propose a novel resample-replace lasso procedure to estimate the coefficients so that the AUC of the biomarker combination is maximized. In our procedure, we first impute the values below LOD using regression predictions, and then use a graphical lasso approach [11] on the fully imputed dataset to obtain mean vectors and precision matrix estimates for cases and controls. Finally, the estimates are then plugged into the expression of optimal linear combination coefficients for biomarkers [27] to get the optimal biomarker combination.
The rest of the paper is organized as follows. Section 2 gives the details of the proposed procedure, where a resample-replace imputation procedure is proposed, and the graphical lasso method is then used to estimate the mean vectors and covariance matrices. In Section 3, simulation studies are conducted to compare the performance of the proposed method with that of alternative methods that replace NA values by LOD values or remove observations with NA values. An application to a protein profiling study of glioblastoma patients on their survival status is given in Section 4. The proof is provided in Appendix.

Method
We introduce notations in Section 2.1 and propose a novel method to combine highdimensional markers with the presence of the LOD in Section 2.2. The consistency of data imputation through the proposed method is given in Appendix.

Notation
The conventional assumption in this article is that after some known transformation, biomarker observations follow multivariate normal distributions. This distributional assumption has been widely used in the biostatistics literature (see, for example, [9,20,21,25,29]). Suppose that m + n subjects are enrolled in a study and p biomarkers are measured on each of them, where m subjects are sampled from the case population and n subjects are sampled from the control population. Assume that the detection limit of the kth biomarker is d k (d k > 0), k = 1, . . . , p. That is, when the value of the kth biomarker on a subject is larger than d k , it can be detected, otherwise it is missing, in which NA is used to denote the missing value. Denote the value of the kth biomarker on the ith subject by x ik in the case population, and that on the jth individual by y jk in the control population. Case observations are denoted by X = (x ik ) m×p and control observations are denoted by Y = (y jk ) n×p .
For the kth column of X (Y), let m k (n k ) be the number of missing values among m case (n control) subjects. And denote the set that contains the index of observed values of kth biomarker as M k for case and N k for control, respectively. Let μ = (μ 1 , . . . , μ p ) and ψ = (ψ 1 , . . . , ψ p ) be the population means of case and control populations, respectively, and V = (v kl ) p×p and W = (w kl ) p×p be the corresponding covariance matrices, where τ denotes the transpose of a vector or a matrix. Let 1 L be an L−dimensional column vector with all units being 1.

The resample-replace imputation procedure with the graphical lasso
Our procedure starts by first arranging all biomarkers based on their percentages of NA values in an ascending order from left to right in the matrix. The imputation procedure is carried out from left to right as well. If the first biomarker has NA values, the observed values of this biomarker are used to estimate the mean and variance of a truncated normal distribution. Random values following the estimated normal distribution are generated, and those less than its LOD value are used to impute NA values so that the first biomarker has 'complete' observations. In many proteomics studies, there is at least one biomarker with complete observations. Thus the aforementioned step is not needed. For the subsequent biomarkers toward the right, each biomarker is imputed in turn via a regression-predict step. That is, its observed values are regressed on the corresponding values of all the biomarkers on its left, and its NA values are replaced by the predicted values plus random noises, which all come from the regression model. After finishing the imputation, we estimate mean vectors and precision matrices -inverse of covariance matrices, for cases and controls through the graphical lasso method. The graphical lasso [11] minimizes the negative log-likelihood with an L 1 penalty on the entries of the precision matrix. The estimated mean vectors and precision matrices are then used to obtain the optimal biomarker combination according to (1). The first step of our procedure is to rearrange the columns of X and Y to impute missing values. Since imputation steps are carried out separately for case and control data, the biomarker sequences for the two population data after rearrangement do not need to be the same. We rearrange the biomarkers in an increasing order based on the missing data percentage, and for convenience we still use X and Y to denote the data after rearrangement. That is, we have 0 ≤ m 1 ≤ m 2 ≤ · · · ≤ m p < m and 0 ≤ n 1 ≤ n 2 ≤ · · · ≤ n p < n. Let a and b be the indices such that m a > 0 and m a−1 = 0, and n b > 0 and n b−1 = 0. For notational convenience, always let m 0 = 0 and n 0 = 0.
Starting with the cases, if a > 1, we have at least one complete biomarker and we can directly carry out regression analysis to impute the missing values sequentially. If a = 1, we first use the truncated normal distribution to model the m − m 1 observed values of the first biomarker with the log-likelihood function being where φ(·) denotes the probability density function of a standard normal distribution. By maximizing it, we get the estimates of the mean and variance of the first biomarker.
Then we generate m 1 random numbers from the estimated truncated distribution with the following density function denoted byx 11 , . . . ,x m 1 1 , to replace the NA values of the first biomarker. This estimatesample step is to ensure that there is at least one complete biomarker to initiate the regression procedure.
The regression analysis is conducted on the remaining biomarkers of X. We set k to start from a if a > 1, or from a + 1 if a = 1. The kth biomarker is regressed on all the biomarkers on its left. To be specific, the responses are {x ik , i ∈ M k }, the predictor variables are {(x i1 , . . . , x i,k−1 ), i ∈ M k }, and the model is: where β k0 is the intercept term, β kl , l = 1, . . . , k − 1 are coefficients and k is the normal error term with mean 0 and unknown variance σ 2 When the rank of Z k is less than m − m k , apply principal component analysis to Z k , and select the top ranked principal components that can explain 80% variance (80% is a hyperparameter here, the choice of which is studied in the supplemental material) of Z k and use them to replace Z k in the regression analysis. The least square estimate of β k is given where ik is generated from the normal distribution N(0,σ 2 k ). The above regression process is repeated from k = a (or a + 1) to k = p. After all the missing values in X are imputed, we rearrange its columns back to the original order, and denote the complete version asX.
The next step of our procedure is to apply the graphical lasso method [11] to the fully imputed case dataX to estimate the mean vector and the bprecision matrix. Denote , representing the ith subject. And let = V −1 = (θ kl ) p×p . Then up to a constant, the penalized log-likelihood is where λ > 0 is the tuning parameter. Setting . After plugging inμ, maximizing l(μ, ) over is equivalent to maximizing . We use the R package glasso to obtain the estimate of V, and denote it asV.
Repeating the same imputation-estimation steps on the control data Y, we get the estimate of ψ and W, denoted asψ andŴ, respectively. Plugging these estimateμ,ψ,V,Û into (1) gives the optimal linear coefficients. The estimates AUC and pAUC of the optimal biomarker combination can be obtained by plugging the mean and covariance matrices in (2) and (3), respectively. Note that in many cases we adopt the assumption that the case and the control subject have the same covariance matrix, that is V = W. Under this condition, the mean vectors μ and ψ, and the covariance matrix V can be estimated using the following penalized log-likelihood And the subsequent calculation is the same as described above.
The aforementioned imputation-estimation procedures are repeated r times to generate r AUC estimators, AUC t , and r pAUC estimators, pAUC t , for t = 1, . . . , r. The final AUC and pAUC estimates are then given by AUC = r t=1 AUC t /r, and pAUC = r t=1 pAUC t /r, respectively.

Numerical studies
In this section, we present a series of simulation studies to demonstrate the performance of the proposed resample-replace imputation graphical lasso method (abbreviated as RIG),   Tables 1 and 2. Table 1 shows the ratios calculated from estimates of pAUCs at α = 0.05 and 0.1, and Table 2 shows the values at α = 0.2 and 1. It can be seen that in all settings the proposed RIG method performs better than the SNL method since all the ratios are larger than 1. From both tables, we can see that 1) For the same sample size combination, the ratios increase as ρ increases. This is expected since the RIG method is based on the idea that the conditional mean of a biomarker is a linear function of other biomarkers. When the correlation ρ is larger, the mean of the data to be imputed can be estimated more precisely. For example, when (m, n) = (100, 100) and the LOD is 7.5, the ratio for pAUC(0.05) under ρ = 0.2 is 1.013, while it increases to be 1.413 under ρ = 0.8. 2) For the same correlation ρ, the ratio becomes larger as the sample sizes get larger. And the results in the tables indicate that the estimates of the proposed RIG method are closer to their true values. For example, when ρ = 0.5 and the LOD=8.5, the corresponding ratios for pAUC(0.05) are 1.003, 1.047 and 1.124, respectively, for the sample sizes (m, n) = (50, 50), (75, 75) and (100, 100).
The ratios of MSEs of IGN over RIG are also given in Tables 1 and 2. It can be seen that in most scenarios, the proposed RIG method performs better than the IGN method. In both tables, there are some NA values. This is because the IGN method may remove all the samples when the LOD level is relative high and the sample size is small. Thus it is not feasible to obtain parameter estimates. In that case, we record the result as NA in the table. These tables show the following patterns: (1) When the sample sizes remain the same, the larger the correlation ρ is, the better performance the proposed RIG method has. For example, when (m, n) = (75, 75) and the LOD is 8.5, the ratios for pAUC(0.1) changes from 0.987 to 1.189 as ρ increases from 0.2 to 0.8. (2) When the LOD level is high, the proposed RIG method does not always perform better. The ratios can be less than 1 when (m, n) = (50, 50). But the RIG method still outperforms the IGN method when the sample sizes are not very small. For example, when ρ = 0.2 and the LOD is 8.5, the respective ratios for pAUC(0.05) are NA, 0.993 and 1.01 for the sample size (50,50), (75,75) and (100,100). This is because the estimates of mean vectors and covariance matrices based on the IGN method are biased when sample sizes are small. The ratios get larger as the sample size gets larger. It is more reasonable to compare the IGN and proposed RIG methods under larger sample sizes, since the percentages of NA values in the 1000 estimates for pAUCs obtained from the IGN method are smaller and their influence on estimate accuracy becomes smaller as well. When the sample sizes are larger, the ratios are larger than 1 in general, indicating that the proposed RIG method has better performance.
In practice, not all of the biomarkers will provide useful information for distinguishing the case and control groups. To take this fact into consideration, we designed a new simulation setup: we set the first ten elements of μ to be 11 and the rest to be 10 and keep all the other parameter settings unchanged. The results are presented in Tables 3 and 4. The corresponding results show the same pattern as that in Tables 1 and 2. We note that some ratios are much larger than one, so we denote the ratios that are larger than 5 by more than an order of magnitude as ' > 5' in the tables. It is noted that in most of the scenarios we considered here, the proposed RIG method performs the best among the three methods. Despite its superiority, RIG has larger MSEs than IGN does when m = n = 75, ρ = 0.2, LOD is 8.5 and α = 0.05, 0.1, 0.2. The reason is that the correlation between variables is weak (only 0.2) and the missing rate is high with a high LOD value. Under this scenario,  the resample-replace procedure in RIG is not as accurate as other scenarios, and has larger MSEs.
When there are several biomarkers having the same missing percentage, the order of imputation for these biomarkers will affect the outcome of RIG. This problem can be solved in three ways. The first strategy is to permute the imputation order and take the average of the multiple estimates for mean vectors and for the covariance matrices, as suggested by the reviewers. The second strategy is to randomly set the imputation order and estimate the unknown parameters. Another alternative is to go through all the possible imputation orders and choose the one that produces the largest AUC as the final result. To strike a balance between accuracy and efficiency, we employ the first strategy. We denote the variation of RIG that uses the first strategy as pRIG. As for handling data with missing values, in addition to the two imputation methods mentioned above (replacing NAs with LOD values and ignoring observations with NAs), there are two other commonly used imputation strategies: replacing NAs with LOD/2 or √ LOD/2 values and imputing NAs via multiple imputation, which are denoted as SNhL, SNshL and MI, respectively. We compare the performances of these four methods pRIG, SNhL, SNshL and MI through a series of simulation studies. As for pRIG, the number of permutations for each equivalent missing rate of biomarker is 10. The sample sizes are set to be 15, 50 and 100, respectively. And the correlation ρ takes values of 0.2 and 0.8, respectively.
The simulation results are provided in Table 5, in the form of MSE ratios, i.e. the ratios of MSE in pAUC estimations between SNhL and pRIG, SNshL and pRIG, and MI and pRIG. It can be seen that pRIG has significantly smaller MSEs than SNhL and MI in almost all the scenarios considered here. To be specific, when the sample sizes remain the same, the relative performance of pRIG increases as ρ increase. For example, for scenarios where n = m = 100, LOD = 7.5, when ρ = 0.2, the ratios are about 1.3 in general; while when ρ = 0.8, the ratios are near 1.8 for SNhL/pRIG, more than 8 for SNshL/pRIG, and also near 1.8 for MI/pRIG. This is because pRIG is more accurate when variables are more correlated. When the sample sizes are very small (n = m = 15), the ratios are higher, which can be explained by the huge variations due to the small sample size. So compared with SNhL, SNshL and MI, pRIG tends to have smaller MSEs and thus performs better in pAUC estimations.
Summing up, the method pRIG is a variation of the proposed RIG that can deal with situations when there are several biomarkers having the same missing percentage. Based on the simulations above, we can see that RIG, together with pRIG, performs the best in pAUC estimations in most of the considered simulations.

A protein profiling study
The proposed RIG method is motivated by a protein profiling study of glioblastoma patients on their survival status at the Center for Applied Proteomics and Molecular Medicine at George Mason University. Glioblastoma is one of the most aggressive brain tumors in adults with poor prognosis. Even with the major advances in medial imaging techniques and cancer therapies, the prognosis has not been improved according to [12]. Although the median survival days of glioblastoma patients are less than 12 months, a small percentage of patients may survive more than 36 months [14,18,26]. Reference [2] investigated the mRNA and protein expression profiles of glioblastoma tissues from patients who survived longer than 36 months (classified by authors as long-term surival) and those who survived less than 6 months (classified as short-term survival). The authors identified a significant difference in the mRNA expression profiles of three signaling genes and two cellular acid-binding proteins between long-term and short-term survivors.
It is worth noting that glioblastoma prevalence is about 3 per 100, 000 per year according to [12], and studies on this disease usually involve a small number of patients. The study in [2] only has 11 long-term survivors and 12 short-term survivors. And the age of these patients ranges from 33 to 86. All the short-term survivors were older than 50, and longterm survivors were mostly younger with some around 40. Protein profiles were measured on the glioblastoma tissues of these patients. The main goal of the study is to identify the optimal linear combination of these protein profiles that distinguish long-term survival versus short-term survival status. Forty-five biomarker profiles with LOD are of particular interest for the combination. Each of these profiles has less than 30 % of observations that are below LOD. We first estimated the mean vectors and covariance matrices. The corresponding results are shown in Figure 1. Using these estimates, we obtained the coefficients for the optimal linear combination based on (1). Since investigators are particularly interested in partial AUC values at low specificities, we calculated the respective partial AUCs when the specificity is 0.05, 0.1, and 0.2. The estimates of these pAUCs based on the proposed method are 0.05, 0.1, and 0.2, respectively. However, when combining the biomarkers after replacing the NA values with zeros, the corresponding pAUCs estimates are 0.03989, 0.08372, and 0.17503, respectively. This indicates that simply replacing NA values with the LOD values may underestimate the diagnostic accuracy of the combined protein biomarkers for distinguishing short-term and long-term survival status in glioblastoma patients.

Conclusion
We propose a novel method to combine biomarkers with the presence of limit of detection. The method is motivated by a proteomic study, in which part of the data is missing due to the existence of LOD and the sample sizes are smaller than the number of biomarkers of interest. Since existing methods either replace missing data with constants or remove observations that have missing values, they all lead to biased parameter estimates. Besides, because in this case covariance matrices are essentially singular, traditional methods for estimating covariance matrices that rely on the inverse of sample covariance matrices are not applicable. To deal with these two challenges, we first propose a resample-replace imputation procedure to impute missing data based on the multivariate normal model and linear regression technique, and then we employ the graphical lasso method to estimate mean vectors and covariance matrices. The simulation studies show that the proposed method outperforms another two widely used methods in terms of estimating partial AUC values for a biomarker combination. Furthermore, a data analysis based on a glioblastoma study conducted by the Center for Applied Proteomics and Molecular Medicine shows that the biomarker combination obtained via the proposed method has higher diagnostic accuracy rate.
Our method rearranges the biomarkers according to their missing percentage in an increasing order, and each biomarker is imputed by being regressed on those biomarkers that have a lower missing ratio. It requires that at least one biomarker has complete observations. This requirement is satisfied in many proteomics studies including the one used in this paper. Although rearranging the biomarkers may use part of the available biomarkers for the imputation, the estimated covariance matrices are still consistent after the graphical lasso. It is feasible to impute the values below LOD without first rearranging biomarkers, but the variation of parameter estimates will increase.
In the imputation procedure, RIG applies principle component analysis to the regression model, where the top ranked principal components that can explain 80% variance are chosen. 80% here is a hyper-parameter (denoted as q) and may affect the performance of RIG. To study the sensitivity of RIG to the choice of this hyper-parameter, we compare its MSEs under different q and come to the conclusion that the other commonly used q's in practice make little difference regarding the performance of RIG. Details are provided in the supplemental material.
Alternative versions of the sequential linear regression used in this article have been used to impute missing values in microarray data and survey data [7,23]. The authors mentioned that the results are satisfactory for a missing rate as large as 30% for a predictor variable. A thorough review of imputation using sequential regression is provided in [4]. However, our paper is the first to show the statistical property of such imputation methods.
A recent paper by Augugliaro et al. [1] also considered using graphical lasso in highdimensional censored data. Although both their work and ours can handle the LOD in the high-dimensional setting, their focus is different from ours in several aspects. First, they only consider the one-sample problem, while ours is for the two-sample problem. Second, they are interested in estimating relationship between gene biomarkers, while we focus on the classification problem for distinguishing cases from controls.
A similar data configuration to the data with LOD is the mixture distribution of a mass at 0 and the remaining larger than 0. The Tobit model is traditionally used to deal with this kind of data. Under certain conditions, 0's in the data can be treated as a proxy for values smaller than 0. It is worth noting that the proposed method can also handle this type of data.
Simulations in the main text and in the supplemental material show that the proposed resample-replace imputation strategy produces consistent estimates for pAUCs. We prove that asymptotically the imputed data have the same distribution as a sample from the original corresponding unknown population when p is fixed, and the details are given in the supplemental material. However, proving the consistency in the high-dimensional settings is non-trivial, and the theory will be developed as a future topic. Moreover, when the number of imputed values grows as sample size increases, the correlation structure between imputed values becomes complicated and is non-trivial to estimate. Therefore, more future work needs to be done to prove consistency in this case. Also, as a referee pointed out, it might be possible to use lasso rather than the PCA method for the imputation. It will be a future topic to appropriately implement lasso and compare the performance with the PCA described in the paper.