Low-Rank Regression Models for Multiple Binary Responses and their Applications to Cancer Cell-Line Encyclopedia Data

Abstract In this article, we study high-dimensional multivariate logistic regression models in which a common set of covariates is used to predict multiple binary outcomes simultaneously. Our work is primarily motivated from many biomedical studies with correlated multiple responses such as the cancer cell-line encyclopedia project. We assume that the underlying regression coefficient matrix is simultaneously low-rank and row-wise sparse. We propose an intuitively appealing selection and estimation framework based on marginal model likelihood, and we develop an efficient computational algorithm for inference. We establish a novel high-dimensional theory for this nonlinear multivariate regression. Our theory is general, allowing for potential correlations between the binary responses. We propose a new type of nuclear norm penalty using the smooth clipped absolute deviation, filling the gap in the related non-convex penalization literature. We theoretically demonstrate that the proposed approach improves estimation accuracy by considering multiple responses jointly through the proposed estimator when the underlying coefficient matrix is low-rank and row-wise sparse. In particular, we establish the non-asymptotic error bounds, and both rank and row support consistency of the proposed method. Moreover, we develop a consistent rule to simultaneously select the rank and row dimension of the coefficient matrix. Furthermore, we extend the proposed methods and theory to a joint Ising model, which accounts for the dependence relationships. In our analysis of both simulated data and the cancer cell line encyclopedia data, the proposed methods outperform the existing methods in better predicting responses. Supplementary materials for this article are available online.


Introduction
Logistic regression is the standard approach to analyzing the relationship between a binary response variable and a set of explanatory variables.For high-dimensional data, where the number of covariates often exceeds the sample size, a number of penalized methods (Lee and Silvapulle 1988;Cessie and Houwelingen 1992;Meier, van de Geer, and Bühlmann 2008;Li et al. 2015) have been developed to estimate the model parameters.These methods have been applied to analyze biomedical data, such as cancer classification and gene interaction detection; see Zhu and Hastie (2004), Park and Hastie (2008), Hastie, Tibshirani, and Friedman (2009), and Hung and Wang (2012) for extensions.
Although most of the proposed methods consider one binary response, multiple binary outcomes need to be predicted by a common set of predictors in an increasing number of applications (Chen and Huang 2012;Martin et al. 2021).For multiple binary responses arising from many applications such as the cancer cell line data analysis, where the treatment responses from multiple drugs are measured from each cell line, standard logistic regression can be applied by combining multiple binary outcomes into a single categorical response (Greenlund et al. 2005;Hayes et al. 2006).The main limitation of this approach is that it may hide the potential outcome-specific covariate effects, leading to inefficient estimates.Another approach is to fit separate logistic models for each response (Koziol-McLai, Coates, and Lowenstein 2001;Cumyn, French, and Hechtman 2009).Although the effects of the covariates on each response can be distinguished, this approach does not consider the potential relationship among multiple responses.When analyzing the Cancer Cell Line Encyclopedia (CCLE) (Garnett et al. 2012) data, which is presented in detail in Section 4.1, if the binary outcomes of multiple drugs are considered on the same cell line, they are likely to be correlated within a cell line (Fitzmaurice et al. 1995;France and Velanovich 2009;Carpenter 2010).Because multiple drug responses are available for each individual cancer cell line, there could be an information gain when jointly considering all the drug responses for the same cell line.Thus, one could consider multiple response logistic regression models that exploit information on the common gene expression and the multiple drug response data derived from each cell line.Despite this potential advantage, multiple binary response regression based on a reduced set of predictors has not been rigorously investigated in the literature.
To address the limitations of the existing methods for multiple binary responses, we establish a high-dimensional multivariate logistic regression to estimate the coefficient matrix by imposing a structural assumption on this matrix.For the analysis of the CCLE data, it could be reasonable to assume that the coefficient matrix has a certain structure, for example, specific genes could have similar effects on multiple responses.Because the dimension of the CCLE data is large because of the large number of genes as well as the number of drugs tested, standard statistical inference may be unstable.It is important to develop methods that are more efficient in dealing with a matrix parameter estimation.In this study, we assume that the underlying coefficient matrix is both row-wise sparse and lowrank to capture the potential similarities among the effects of different drugs to better predict treatment responses.
Such low-rank and row-wise sparse assumptions are commonly used in a wide variety of matrix-related estimation problems to reduce the effective number of model parameters and to achieve better statistical performance.Among numerous approaches for low-rank matrix estimation, nuclear norm regularization is one popular approach including multitask regression (Yuan et al. 2007;Bach 2008;Rohde and Tsybakov 2011;Negahban and Wainwright 2011;Chen, Dong, and Chan 2013;She and Chen 2017;Cho and Park 2022), matrix completion (Candés and Recht 2009;Keshavan, Montanari, and Oh 2010;Koltchinskii, Lounici, and Tsybakov 2011;Negahban and Wainwright 2012), and collaborative filtering (Srebro, Rennie, and Jaakkola 2005).The nuclear norm is the sum of the singular values of a matrix; so nuclear norm regularization encourages many zeros in singular values for the estimated matrix, that is, it leads to a low-rank estimate.Chen, Dong, and Chan (2013) proposed an adaptive nuclear norm penalization approach in high-dimensional multivariate linear regression.
Another line of work, called reduced-rank regression (RRR) (Izenman 1975;Reinsel and Velu 1998), is based on a restriction imposed on the rank of the regression coefficient matrix; it has proven to be effective for the prediction of multiple responses from common predictors.As for the estimation of this rank-deficient coefficient matrix parameter in RRR, the minimax optimal estimation has been studied in the literature (Bunea, She, and Wegkamp 2011;Giraud 2011;Bunea, She, and Wegkamp 2012;She 2017;Bing and Wegkamp 2019;She and Tran 2019).Bunea, She, and Wegkamp (2011) proposed a rank selection criterion for selecting the optimal reduced rank estimator of the coefficient matrix in multivariate response regression models.She (2017) proposed selective RRR for constructing optimal explanatory factors.She and Tran (2019) developed a predictive information criterion to achieve the minimax optimality when the coefficient matrix parameter is both reduced rank and row-wise sparse.For the variable selection methods in RRR, see Turlach, Venables, and Wright (2005), Chun andKeles (2010), andPeng et al. (2010).Most related studies including the cited papers above have focused on linear regression type problems with continuous multivariate responses.For multivariate categorical responses, the class of vector generalized linear models has been widely used and examined only in computational and applied studies, for example, Yee (2015).
In this article, we establish high-dimensional multivariate logistic regression and general asymptotic theory with low-rank and row-wise sparse assumptions.This facilitates suitable statistical estimation and inferences of the multiple binary responses such as in the CCLE data application.To the best of our knowl-edge, this study is one of the first to develop rigorous multivariate regression theory for discrete responses such as binary variables.We propose new nonconvex penalized estimation methods under both marginal and joint model set-ups, and we conduct novel theoretical analyses under weak dependencies of the binary responses.The proposed penalization uses the sums of the Smoothly Clipped Absolute Deviation (SCAD) (Fan and Li 2001) transformations of singular values and Euclidean norm of rows, respectively.The first part, that is, the sum of the SCADs of the singular values, is interpreted as a non-convex variant of the nuclear norm penalization.Despite the remarkable popularity for a parsimonious model selection, such nonconvex nuclear norm penalizations via SCAD have been not studied in the high-dimensional matrix literature.For an overview of the nonconvex penalization research, see Fan and Lv (2010) and references therein.It is worthy noting that most existing literature on nonconvex penalties analyzes coordinate-separable penalty functions satisfying R(β) = j R(β j ), for example, see Loh andWainwright (2014, 2017).In contrast, our penalty functions for rank selection are not coordinate-separable, which poses challenges in developing theoretical properties.By addressing theoretical and computational issues due to the nonconvexity of the nuclear penalization, we prove that the proposed optimization has a unique stationary point so that it avoids the potential multiple local minima problem in nonconvex optimizations.That is, the global minimizer B is indeed the unique stationary point of the proposed nonconvex problem in multivariate logistic regression models.
We establish high dimensional statistical theory of the unique stationary point B. Specifically, we show that B enjoys a tight estimation error bound in terms of an operator norm with a rate of B − B * op {max(r S , q)/n} 1/2 , where S ⊆ {1, . . ., p} represents an index set of nonzero rows of the matrix B * , r S = rank(X •,S ) ≤ s = |S|, and q is the number of responses.Additionally, we can achieve a true rank r ≤ min(p, q) of the underlying coefficient matrix B * ∈ R p×q and the true row support set S at the same time via the developed method.Moreover, we propose a new data-adaptive selection criterion to choose the rank and row-dimension of a coefficient matrix simultaneously.We prove that it provides an estimate with rank consistency as well as support consistency of rows.Compared with the theoretical analysis in the existing multivariate regression literature in the high dimensional setting (Chen and Huang 2012;Bunea, She, and Wegkamp 2012;She and Chen 2017;Bing and Wegkamp 2019;She and Tran 2019;Zou, Ke, and Zhang in press), our results are new in that we consider discrete multiple responses, that is, multivariate logistic regression, and nonconvex version of the nuclear norm penalization, and we develop a new data-adaptive choice for consistent simultaneous selection of the rank and nonzero rows of a matrix parameter.For detailed comparisons to the existing literature, we refer to Sections 2.2.1 and 3.4.Further, we develop an alternating direction method of multipliers (ADMM) (Boyd et al. 2011) algorithm in our implementation to solve the proposed constrained nonconvex optimization.
The remainder of this article is organized as follows.In Section 2, we introduce the proposed marginal and joint models, respectively, to predict multiple binary responses simultaneously, and develop statistical procedures and theory.In Section 3, we perform simulation studies to evaluate the finite sample performance of the developed methods.In Section 4, we apply the developed methods to the CCLE data to illustrate the advantage of our method over the existing methods.All the technical proofs are deferred to the supplementary materials.

Marginal Logistic Regression Models
We consider a high-dimensional multivariate marginal logistic regression, where for units i = 1, . . ., n with a covariate vector x i ∈ R p , there exist q binary responses y where p k i = Pr(y k i = 1|x i ) is the conditional probability of y k i = 1 given x i , and both dimensions p and q are allowed to increase with n.Such datasets are widely encountered in an increasing number of biomedical studies.For example, for the CCLE data, the gene expression levels x i and the treatment responses y k i corresponding to drugs k = 1, . . ., 24, were measured for each of the 482 cell lines i = 1, . . .482.As the 24 treatment outcomes are likely to be correlated within a cell line, we need to simultaneously consider a logistic modeling of multiple responses to exploit information on the common gene expression.We also allow dependence between responses, that is, y k i for k = 1, . . ., q may satisfy cov(y k i , y k i |x i ) = 0 for k = k given x i .With multiple responses, the model parameters can be written in matrix-form, namely B * = [(β * ) 1 , (β * ) 2 , . . ., (β * ) q ] ∈ R p×q , and we impose both low-rank and row-wise sparse constraints on this high dimensional matrix B * .That is, we assume that the row support set of B * is S = {1, . . ., s} with s p, and rank(B * ) = r ≤ min(s, q).These are plausible in the CCLE data application: some responses (drugs) may be similar, that is, their associations with covariates (gene expression levels) are similar, and only a few variables (genes) are associated with responses (drugs), that is, only the first s rows of B * can have nonzero values.The low-rank assumption can also be interpreted as that the underlying coefficients (β * ) k 's lie inside a lower dimensional space (Rohde and Tsybakov 2011;Bunea, She, and Wegkamp 2012;Chen and Ye 2014).In our theoretical analyses, the number of covariates p, the true row model size s, and the number of responses q are allowed to increase with the sample size n.
For the matrix notation, define the binary response matrix Y = [y 1 , . . ., y q ] ∈ R n×q and the covariate matrix . ., q) be the negative log-likelihood function for conditional distribution y k i |x i of each drug response y k given covariates x i .Following Meier, van de Geer, and Bühlmann (2008), we assume that the probability p k i in Model (1) satisfies for some n ∈ (0, 0.5) so that we can consider the settings where the probability p k i approaches to zero or one by allowing 1), where A max = max i,j |A i,j | for a matrix A. Because XB * max ≤ X max j B * j,• 2 , where B * j,• is the jth row of B * , the inequality j B * j,• 2 ≤ −logit( n )/ X max is a sufficient condition for (2).Motivated by this, we consider the following feasible set: := {B = [β 1 , . . ., β q ] ∈ R p×q : j B j,• 2 ≤ b} for some b > 0. One could estimate the coefficient matrix B * by using the likelihood principle, that is, minimizing the negative marginal likelihood n i=1 q k=1 (β k |x i , y k i ), subject to low-rank and row-sparse space constraints on B, that is, B ∈ {[β 1 , . . ., β q ] ∈ : rank(B) ≤ r, B S c ,• = 0}, where r ≤ min{p, q} and S ⊆ {1, 2, . . ., p} are pre-specified.Denote this estimator, that is, the constrained Maximum Likelihood Estimator (MLE) by B r, S .The negative logistic likelihood n i=1 q k=1 (β k |x i , y k i ) is obtained under the assumption that the multiple binary variables y k i , k = 1, . . ., q are independent.It is also used as the marginal negative likelihood for the models allowing for correlated multiple binary responses such as in the CCLE data application.Usually, it is quite complicated to model the dependence relationships between variables, in particular in high dimension as in our CCLE application.If the (conditional) marginal distribution y k |x for k = 1, . . ., q is of major interest, the marginal likelihood may be conveniently used, which has been employed in various problems with dependent variables (Wilson 1989;Welsh, Lin, and Carroll 2002;Jentsch, Lee, and Mammen 2021;Lee and Park 2021).Refer to Section 2.4 for the joint model setting of the multiple responses involving the dependence structures.
Because it is computationally infeasible to choose a rank r and row-wise space S together, especially for high dimension, we develop a novel penalized approach for the simultaneous choice of a matrix rank and row space, and coefficient matrix estimation in this work.More specifically, we consider the following optimization with penalty functions P λ 1 (•) and P λ 2 (•): where λ 1 , λ 2 > 0 are regularization parameters, σ j (B) is the jth largest singular value of B, and P λ k (•)'s are SCAD functions (Fan and Li 2001) with regularization parameters λ 1 , λ 2 , a > 0: o t h e r w i s e .
We set a = 3.7 as suggested by Fan and Li (2001) and Tang, Park, and Zhao (2022), which is known to be optimal based on cross-validated empirical studies (Loh and Wainwright 2014).In (3), two different penalty terms are included.The first one, that is, p∧q l=1 P λ 1 {σ l (B)}, aims to obtain low-rank estimation.The second one, that is, p j=1 P λ 2 ( B j,• 2 ), aims to obtain rowwise sparse estimation.Then, S and r are estimated by the row support set and rank of the estimate B, that is, Note that the proposed optimization (3) is nonconvex due to the nonconvex SCAD penalty terms, which can involve multiple local minimum points.In Section 2.2, we prove that the proposed optimization (3) actually has a unique stationary point under some conditions, that is, a global minimum point is a unique stationary point.We also derive the estimation error bound of the proposed estimator (3) and show the rank consistency and row-selection consistency for the proposed estimator.

Assumptions and Results
Throughout the theoretical considerations, let vec(•) : R p×q → R pq be the transformation of a matrix into a vector, by stacking its columns into a column vector, and mat(•) : R pq → R p×q be its inverse operation.For a function f , let ∇f denote a gradient or subgradient, if it exists.
Throughout the article, A op represents an operator norm of a matrix A, that is, the maximum singular value of A, and λ max (B) is a maximum eigenvalue of a symmetric matrix B. We write a b if a ≤ C 1 b for some positive constant C 1 , write a b if a b and b a, and use a∨b and a∧b to denote max(a, b) and min(a, b), respectively.For a matrix A, let row-support(A) = {j : A j,• = 0} be the index set of nonzero rows of the matrix A, and support(A) = {(j, k) : A j,k = 0} be the index set of nonzero entries of the matrix A. For a vector x, let x 0 be the number of nonzero elements in the vector x.For a matrix For a matrix A and index sets of rows and columns, say S 1 and S 2 , let A S 1 ,• and A •,S 2 be the submatrices of A with rows in S 1 and columns in S 2 , respectively.Let r X ≤ n ∧ p and r S ≤ n ∧ s be the rank of X and X •,S , respectively.A stationary point B = [ β 1 , . . ., β q ] of the optimization (3) is any point in the feasible set satisfying /n represents the Fisher information matrix related to the kth response.
We make the following assumptions to facilitate our technical derivations.
Assumption 2. [On the correlations between the drug responses] Denote the conditional covariance matrix of the multiple responses y i = (y 1 i , . . ., y q i ) given X i = x by x := cov(y|x) ∈ R q×q .Then, the largest eigenvalue of x is uniformly bounded, that is, sup x∈R p λ max ( x ) ≤ q y for some 0 < q y < ∞.
Assumption 3. [On the coefficient matrix] The underlying coefficient matrix Assumption 4. [On the eigenvalues] The eigenvalues of X T •,S X •,S /n is bounded from above by some constant max (S), that is, max (S) Assumption 1 is referred as a restricted strong convexity condition that is also imposed in Raskutti, Wainwright, and Yu (2010) and Loh andWainwright (2014, 2017), where α k represents a curvature parameter and γ k represents a tolerance parameter for k = 1, 2. It is related to the lower and upper bounds of the conditional probabilities p k i , and it holds when the p k i are uniformly bounded away from zero and one.Assumption 2 is interpreted as dependence conditions imposed on the covariance matrix y k i .Note that q y ≤ q by simple calculations.If the responses y k i for k = 1, . . ., q are uncorrelated given x i , then it is satisfied, for example, by taking q y = 1/4.More generally, it holds if the covariance matrix x is sparse in the sense that there exists a function c : for any i ∈ {1, . . ., n}, k, k ∈ {1, . . ., q}, and x.Thus, one easily sees that Assumption 2 is fulfilled for weakly dependent binary variables such as m-dependent and autoregressive logistic processes (e.g., Sim 1993).Assumption 3 guarantees the feasibility of the underlying B * , and gives beta-min type conditions imposed on the nonzero singular value and nonzero rows of B * , respectively.A similar condition is also imposed in Bunea, She, and Wegkamp (2012), Chen, Dong, and Chan (2013), and She and Tran (2019).Assumption 4 is an eigenvalue condition, which is commonly assumed in the literature of high-dimensional models (Zheng, Peng, and He 2015).Assumption 5 is also imposed in Zheng, Peng, and He (2015) with a slight modification to prove the oracle property of their estimate, which restricts the correlations between relevant covariates and irrelevant covariates.
Note that the proposed method enjoys the rank consistency without the following irrepresentable condition: The rank consistency based on the conventional nuclear norm penalization is known to require an additional irrepresentable condition, which can be found in Bach (2008) and Kong et al. (2020).Our results show that the proposed method based on SCAD-type penalty for singular value will guarantee the rank consistency even in the absence of the irrepresentable condition, illustrating the theoretical advantage of the SCAD in the rank selection context.Similar advantage of SCAD over 1 norm penalization (Lasso) has also been investigated in high-dimensional univariate regression (e.g., Loh andWainwright 2014, 2017).The following theorem presents the theoretical properties of the proposed method.
Theorem 1. Suppose that Assumptions 1-5 hold, sq log p = o(n), and If λ 1 and λ 2 satisfy λ 1 {(q ∨ r S )/n} 1/2 and λ 2 {q log(q ∨ p)/n} 1/2 , then the program (3) has a unique stationary point B, and it satisfies Theorem 1 shows that if the penalty parameters are appropriately chosen, then the proposed method achieves the same convergence rate as the oracle estimator B r,S , that is, the constrained MLE using the true rank r and row-support set S, in terms of the Frobenius norm • F .See Theorem S1 in the supplementary materials for theoretical properties of the constrained MLE.Moreover, it implies that it leads to consistent simultaneous selections of the rank r and row space S of relevant covariates.One also sees that although the proposed optimization (3) is nonconvex, there exists a unique solution B, more specifically, the proposed estimate B is the unique stationary point of the nonconvex problem (3).It is worth noting that the growth rate conditions sq log p = o(n) and λ max (Q * S )/λ min (Q * S ) = o( n/sq) are only required to prove the uniqueness of the stationary point as in Kim and Kwon (2012), and Loh andWainwright (2014, 2017) and rank consistency as in Bach (2008) and Kong et al. (2020).A similar condition can also be found in Loh and Wainwright (2017) in high dimensional univariate regression contexts.Without this growth rate condition, the local optimality is only guaranteed for B. The growth rate conditions have been assumed in many published studies about the global optimality of nonconvex penalized methods in the other settings (Fan and Li 2001;Wang, Wu, and Li 2012;Loh and Wainwright 2014).

Comparisons with Related Results
Here, we provide detailed theoretical error rate comparisons to related literature, including the existing multivariate reduced rank regression papers and extensions of some penalizations to our multiple logistic settings.We would like to note that a main goal of our study is to achieve selection consistency (of rank and row support set), which generally needs stronger assumptions than to attain optimal estimation rates.
Even though our main interest, penalization techniques, and model set-ups are different, we consider estimation of XB, that is, X B − XB * F , which is the main focus in some literature (Giraud 2011;Bunea, She, and Wegkamp 2012) in the multivariate response linear regression setting.Bunea, She, and Wegkamp (2012) achieved an error rate of (sr log p + qr) 1/2 and She (2017); She and Tran (2019) achieved an error rate of {(qr + sr + s log p)} 1/2 under the setting that the underlying coefficient B * is low-rank and row-wise sparse.Theorem 1 stated above implies that our estimator B satisfies X B − XB * F = O p {r(q + r S )} 1/2 , which is tighter than the obtained rate in the aforementioned literature because Theorem 1 assumes stronger conditions.If we are interested in estimation consistency instead of selection consistency, a slight modification of the proposed method can lead to X B − XB * F = O p {(qr + sr + s log p)} 1/2 as in She (2017) and She and Tran (2019) under more mild conditions.We refer to Theorem S2 in the supplementary materials for more details.Under the setting that the underlying coefficient B * is just lowrank, Bunea, She, and Wegkamp (2011) and Bing and Wegkamp (2019) achieved a minimax error rate of [r(q + r X )] 1/2 .It is worth noting that the obtained error bound X B − XB * F = O p {r(q + r S )} 1/2 corresponds to this minimax error rate when only relevant covariates in X, that is, X ,S , are included in the low-rank matrix estimation problem.
Theorem 1 also implies that the proposed method has theoretical advantage than extensions of some existing penalizations to our multivariate response logistic regression setting: regular logistic regression with 1 penalty (RL) to individual responses; and multiple response logistic regression with group Lasso and nuclear norm penalties (MLL), which are also considered in the simulation analysis as in Section 3. Using the existing theories applied to Lasso, we can prove that RL achieves B − B * F = O p {sq/n} 1/2 .This implies that the proposed method provides a tighter estimation error bound because it always holds that r(q ∨ r S ) ≤ sq.Specifically, if the rank r is small, that is, r q ∧ s, the proposed method can provide a substantial rate improvement over RL by considering multiple responses jointly through the low-rank and row-wise sparse structural assumption.Additionally, through modification of theories used in Theorem 1, it can be shown that MLL attains the same rate B − B * F = O p {r(q ∨ r S )/n} 1/2 , but with an additional irrepresentable condition as mentioned in Section 2.2.Refer to Section 3 for numerical comparisons with RL and MLL.

Simultaneous Selection of Rank and Row Space
We now propose a data-dependent way to select the regularization parameters λ 1 and λ 2 for the proposed method (3), and prove that the proposed selection criteria jointly choose the underlying rank r and row support set S with probability tending to one.Recently, the generalized information criterion (GIC), including the Bayesian information criterion, has been used to select the tuning parameter of penalized methods for variable selection in high-dimensional models, for example, Chen and Chen (2008), Fan and Tang (2013), Lee, Noh, and Park (2014), and Zheng, Peng, and He (2015).Accordingly, we define an information criterion suitable for our simultaneous choice problem by modifying the GIC.See also Nishii (1984).Let B λ 1 ,λ 2 be the proposed estimate using penalty parameters λ 1 and λ 2 .We propose to select λ 1 and λ 2 as the minimizers of the following information criterion: n + sqg (2) n , (4) where s = |row-support( B λ 1 ,λ 2 )| and r = rank( B λ 1 ,λ 2 ).Here, we can see that r(s + q) and sq are the degrees of freedom corresponding to the choices of the rank r and the row sparsity s, respectively.There are similar computations of the degrees of freedom in the literature, for example, Bai and Ng (2002) and Huang, Breheny, and Ma (2012).
For selection of the row support set S, we set an upper bound of the size of the row support set, denoted by U(n), with s < U(n) < p, that is, we only consider B satisfying |row-support(B)| ≤ U(n).Similar assumptions of restricting the model dimension have been widely used in related studies., for example, Fan and Tang (2013) and Zheng, Peng, and He (2015).For d ≤ p, define r d = max A⊆{1,...,p}: |A|=d rank(X •,A ).It holds that r d ≤ r X ∧ d.Following Zheng, Peng, and He (2015), let max be the uniform upper bound of the eigenvalues of X T •, S X •, S /n with | S| ≤ U(n), that is, max := sup δ∈R p , δ 0 ≤U(n),δ =0 δ T X T Xδ/(n δ 2 ) < ∞.To discuss selection consistency of the proposed information criterion (4), we introduce some notations for mis-specified models for (r, S), that is, overfitted and underfitted models.We define an overfitted model when (λ 1 , λ 2 ) ∈ Z 1 Z 2 , where Similarly, we define an under-fitted model when (λ 1 , λ 2 ) ∈ Z 3 Z 4 , where Theorem 2 implies that λ 1 and λ 2 chosen from the proposed information criterion provides the estimate with the correct rank r and correct row support set S with high probability.
Theorem 2. Suppose that the conditions in Theorem 1, min j∈S B * j 2,• / {U(n) + s + q}q log p/n 1/2 → ∞, and 2 be the proposed estimate using the optimal penalty parameters λ * 1 and λ * 2 satisfying conditions in Theorem 1.If g (1)  n → 0, g (2)  n → 0, g (1)  n n/ log p → ∞, and g (2)  n n/ log p → ∞, we have as n, p, q → ∞, Pr min Theorem 2 suggests that wide ranges of g (1)  n and g (1)  n work in theory.This includes the following specifications as special cases: g n = log(log n) log p/n.In this case, the GIC in (4) can be written as In our simulation and real data examples, we set g (1) n = log(log n) log p/n and we observed that these choices perform well.
The beta-min (minimum signal strength) conditions and the restrictions on the rank r and the sparsity parameter s in Theorem 2 are essential for rank and row selection consistency of the GIC-type penalty.Note that She (2017) and She and Tran (2019) proposed a new class of information criteria called PIC, which achieved the minimax optimal error rate without the beta-min conditions.However, they also require the beta-min conditions to guarantee the row support recovery via PIC as in Theorem 5 of She and Tran (2019).Among different types of PIC, we choose the following criterion to compare it with our GIC: for some absolute constants A 1 , A 2 > 0, which can be determined by Monte Carlo experiments.Following She and Tran (2019), we set A 1 = 2 and A 2 = 1.8, and we check that this specification works well in our numerical experiments.
For comparisons of the proposed GIC and PIC via simulation analyses, see Section 3.4.

Joint Ising Models for Multiple Binary Responses
In this section, we consider the joint Ising models (Ising 1925) for multivariate binary data.The models account for the conditional dependency relationships between the outcome variables as well as for the marginal models as in Section 2.1.Following Cheng et al. (2014), we assume a sparse covariate dependent Ising model to study both the conditional dependency within the binary data and its relationship with the additional covariates.Let ).We assume that the logodds depend on the covariates through log Compared to model (1), model ( 6) includes the additional term l =k x T i (θ * ) l,k y l i , from which dependencies between responses are considered.Let Specifically, the vector (θ * ) k,l being zero implies that y k i and y l i are conditionally independent given x i and y t i for t = k, l, and (θ * ) k,l j = 0 implies that the conditional association between y k i and y l i does not depend on x ij (Cheng et al. 2014).We assume that S = row-support(B * ) = {1, . . ., s} with |S| = s p, rank(B * ) = r ≤ min(s, q), and F = support( * ) with |F| = f p(q − 1)q.Note that a sufficient condition for (2) is for all k is a sufficient condition for (2).Thus, we consider the following feasible set: ) .We can consider the following optimization: where λ 1 , λ 2 , λ 3 > 0 are regularization parameters, and jk is the (j, k)th component of a matrix .For a penalty function P λ j (•), we consider the SCAD functions.For choices of λ = (λ 1 , λ 2 , λ 3 ), we consider the following GIC, which generalizes the GIC in (5): where s, r, and f are the estimates of s, r, and f obtained from B λ and λ .
The following theorem presents the theoretical properties of the proposed penalization incorporating the Ising models.We make new assumption for the joint Ising models that replace Assumptions 1 in the marginal models.Assumption S1, presented in the supplementary materials, is the restricted strong convexity condition in the Ising models, which corresponds to Assumption 1 in the proposed marginal models.
/n represents the Fisher information matrix related to the kth response.
Theorem 3. Suppose Assumptions 2-5 and Assumption S1 hold, and λ 3 satisfy , and λ 3 {log(p ∨ q)/n} 1/2 , then the program (7) has a unique stationary point ( B, ), and it satisfies Theorem 3 shows that if the penalty parameters are appropriately chosen, then the proposed method incorporating the Ising models leads to consistent simultaneous selections of the rank r, row space S, and support set F. Although the optimization ( 7) is nonconvex, the proposed estimate ( B, ) is the unique stationary point of the nonconvex problem (7).

Methods for Comparison
We conduct simulation studies to investigate the finite sample performance of the proposed method.Specifically, to illustrate the advantages of the proposed estimator in the multivariate logistic regression framework, we compare it to two other logistic regression methods: Multiple response Logistic regression with SCAD penalties (MLS) Multiple response Logistic regression with group Lasso and nuclear norm penalties (MLL) Regular Logistic regression with 1 penalty (RL).
Here, MLS is the proposed estimate in (3), MLL is the estimate using nuclear norm and group Lasso penalties, that is, estimate of (3) with a penalty λ 1 l σ l (B) + λ 2 j B j,• 2 , and RL is the logistic regression with Lasso penalty by considering each response separately.We consider a setting where the underlying B * is both low-rank and row-wise sparse.For regularization parameters in MLL and MLS, we choose λ 1 and λ 2 using the proposed generalized information criterion as in Section 2.3.We set U(n) = [n/ log p] and b = −logit(0.01)/X max , which satisfies (2) with n = 0.01.Note that RL involves only one regularization parameter λ.We choose λ using the Bayesian information criterion.Using this comparison, we can assess the performance of each method with the appropriately chosen regularization parameters.

Simulation Model
In our simulation study, we consider the multiple response logistic model, where the underlying coefficient matrix T is both low-rank and row-wise sparse, where β * 0 is the intercept coefficient, the first s + 1 rows of B * are nonzero, and the remaining p − s rows of B * are set to zero.We set β * 0 = (0.5, . . ., 0.5) ∈ R q , and the nonzero entries in [(β * 1 ) T , . . ., (β * s ) T ] T are generated by the singular value thresholding of B * ∈ R s×q , such that the rank of B * , except β * 0 , is r, where each entry in B * follows the mixture distribution 1 2 Uniform(0, 5) + 1 2 Uniform(−5, 0).Specifically, for the singular value decomposition of s∧q) is a diagonal matrix whose diagonal components are singular values of B * sorted in nonincreasing order.We set r , where U r and V r are the first r columns of U and V, respectively, and D r is the upper-left r by r sub-matrix of D. The dimensions are chosen from n ∈ {400, 800}, p ∈ {400, 800}, s ∈ {10, 20}, r ∈ {5, 10}, and q = 25.Note that we consider the datasets X = [1 n , X] ∈ R n×(p+1) and Y ∈ R n×q , where When generating X, each row of X follows a multivariate normal distribution, MN(0 p , ), where = (0.5 |i−j| ) 1≤i,j≤p .Thus, a total of 16 different simulation cases are considered, with varying data distributions and model parameters.

Simulation Results
To compare the performance of different methods, we consider the identification accuracy of the nonzero rows and rank, and estimation accuracy for the estimates in terms of the following criteria: • Relative estimation error B − B * F / B * F between the underlying B * and the estimate B • PUS: Percentage of Under-fitted models in terms of the row Support set, that is, S \ S = ∅ • PCS: Percentage of Correctly fitted models in terms of the row Support set, that is, S = S • POS: Percentage of Over-fitted models in terms of the row Support set, that is, S S • PCR: Percentage of Correctly fitted models in terms of the Rank, that is, r = r • Average of the estimated rank r The performance measures are averaged over 100 replications for each specification.It is worth noting that we consider 100 cases in which the true support set and true rank are included in a sequence of candidate models.The proposed information criterion is used to select the tuning parameters for MLS and MLL.
Table S1 in the supplementary materials displays the average of the relative estimation errors with the corresponding standard deviations within parentheses for the three methods over the 16 different cases.We can see that the proposed method MLS achieves smaller relative estimation errors compared with MLL and RL for most cases.This suggests that MLS with appropriately chosen regularization parameters can improve the estimation accuracy compared to the other two methods when the underlying coefficient matrix is both row-wise sparse and lowrank.Table 1 and Table S2 in the supplementary materials record the performances of the row support identification for MLS, MLL, and RL, when n = 400 and n = 800, respectively.We can see that RL tends to produce an over-fitted model because RL does not incorporate multiple responses.Overall MLS has the best performance in the row support set estimation accuracy, which demonstrates that the reduced rank-related smoothly clipped absolute deviation penalty does not affect negatively to the sparse row-related smoothly clipped absolute deviation penalty.This also demonstrates that by accumulating information across responses and using SCAD penalties for row support selection, the proposed method may boost the power of correct model selection.Table 2 and Table S3 in the supplementary materials record the performances of the rank estimates for MLS and MLL, when n = 400 and n = 800, respectively.For the average rank, we also record standard deviations within parentheses.Because RL does not involve rank-based regularization, it is not included in this comparison.We can see that MLS gives more accurate rank estimates on average compared with MLL for all cases.MLS outperforms MLL also in correct rank accuracy for all cases.This demonstrates that MLS with appropriately chosen regularization parameters improves the rank accuracy.

Comparisons of GIC and PIC
In this section, we compare the proposed GIC and the existing PIC in terms of rank and row support selection.For simulation settings, we generate data as in Section 3.2, and follow She and Tran (2019) when generating an underlying coefficient matrix B * .Specifically, we set the coefficient matrix and entries in A 0 and A 1 are iid standard Gaussian.We consider the following six different simulation settings, with varying model parameters: (a) n = 400, p = 200, q = 10, s = 10, r ∈ {3, 5}; (b) n = 400, p = 400, q = 10, s = 15, r ∈ {3, 5}; (c) n = 400, p = 600, q = 15, s = 20, r ∈ {3, 7}.
Tables S4 and S5 in the supplementary materials record the comparisons of GIC and PIC in terms of the row support accuracy and rank estimate accuracy, respectively.Overall, both GIC and PIC provide accurate row support accuracy, whereas PIC tends to overfit compared to GIC.Regarding rank estimate, GIC provides accurate results except when (n, p, q, s, r) = (400,600,15,20,7).In this case, GIC tends to underestimate the underlying rank r, which may be due to the fact that the term log p in GIC causes overpenalizing the rank.

The Dataset
With the development of high-throughput technologies in recent decades, various studies have been conducted to build genomic predictors of drug response from large panels of cancer cell lines (Staunton et al. 2001;Barretina et al. 2012;Garnett et al. 2012).The CCLE dataset offers a comprehensive collection of gene expression and anti-cancer drug responses across cell lines (Barretina et al. 2012;Garnett et al. 2012).This dataset contains drug treatment responses for 24 drugs on 482 cancer cell lines, where the transcription profile of each cell line is characterized by the measured expression levels of 18,988 probes.These data have an important role in drug discovery for the screening of drug candidates (Blanco, Frigola, and Aloy 2018).They are widely used to understand cancer biology and to test the efficacy of novel therapies (Sharma, Haber, and Settleman 2010) due to their cost-effectiveness and unlimited replicative nature (Ferreira, Adega, and Chaves 2013).
For the cancer cell line data analysis, standard logistic regression can be applied by combining multiple binary outcomes into a single categorical response (Greenlund et al. 2005;Hayes et al. 2006).The main limitation of this approach is that it may hide the potential outcome-specific covariate effects, leading to inefficient estimates.Another approach is to fit separate logistic models for each response (Koziol-McLai, Coates, and Lowenstein 2001;Cumyn, French, and Hechtman 2009).Although the effects of the covariates on each response can be distinguished, this approach does not consider the potential relationships among multiple responses.When analyzing the cancer cell line encyclopedia data, if the binary outcomes of multiple drugs are considered, they are likely to be correlated within the same cell line (Fitzmaurice et al. 1995;France and Velanovich 2009;Carpenter 2010).Because multiple drug responses are available for each individual cancer cell line, there could be an information gain when jointly considering all the drug responses for the same cell line.Thus, one could consider multiple response logistic regression models that exploit information on the common gene expression and the multiple drug response data derived from each cell line.Despite this potential advantage, multiple response regression based on a reduced set of predictors has not been investigated for logistic modeling, either theoretically or practically, in the literature.
In this analysis, we consider binary drug responses defined as being either sensitive or resistant between a drug and the treated cell line.We first select the 3000 probes with the largest variances across the 482 cell lines and then choose 450 probes having the largest correlation coefficients with the IC50 values.Therefore, we consider n = 482 cell lines, p = 450 features, and q = 24 drugs, respectively.We divide the cell lines into two groups based on their continuous growth inhibition values (IC50) such that the cell lines with their IC50 value less than 0.5 are assigned to the "sensitive" group, and the other cell lines are assigned to the "resistant" group.

Results
For prediction accuracy, we consider the following five methods including the three logistic regression methods considered above, that is, MLS, MLL, RL, and adding Linear Support Vector Machine (SVM) and Support Vector Machine with a Gaussian kernel (SVMG) using the optimal parameter.To evaluate the prediction accuracy of each method, we perform 10-fold CV.The cancer cell lines are randomly divided into 10 mostly equalsized subsets, then nine subsets are used as the training set to estimate the model, and the remaining subset is used as the test set to assess the prediction accuracy.The performance of each method is presented by the proportion of the accurately predicted cell lines (PA) and area under the ROC curve (AUC) to investigate the tradeoff between the true positive rate and the false positive rate.Figure 1 displays the overall ROC curve, and average values of AUC, and PA with one standard deviation for the five methods.Here, we include all the cancer cell lines and drugs to compute the performance measures.We observe that the proposed multiple logistic regression, that is, MLS, has higher AUC and PA values than the other methods, suggesting its better performance on average.
Figure 2 illustrates the average prediction accuracy for each drug.Note that when computing AUC for each drug, the test dataset involving only a single response value for the corresponding drug is not included because the ROC curve and AUC cannot be computed for that case.Based on the AUC, MLS, and SVMG have superior predictions compared to the other methods for the majority of drugs.Specifically, MLS, SVMG, and SVM perform best in 13, 7, and 4 of the 24 drugs, respectively.Based on PA, MLS generally outperforms the other methods, suggesting that it may provide superior prediction for the extremely imbalanced test data that are excluded from AUC calculation.
Next, we investigate the performance for each pair of drug and cancer type separately.In particular, we consider the following 16 cancer types that include at least 10 cell lines in the CCLE data: Kidney, Liver, Pancreas, Ovary, Lung, Skin, Lymphoid, Intestine, Stomach, Nervous, Esophagus, Bone, Breast, Endometrium, Ganglia, and Soft.For this study, we only consider PA because for a specific cancer type and drug combination, the test data frequently include a single response value with respect to a drug such that AUC cannot be computed.Figures 3 and S1-S3 in the supplementary materials display the heatmaps of the average PA values for each pair of cancer type and drug.We observe that MLS achieves greater or similar PA values compared to the other methods in many cases, which suggests the benefit of prediction by jointly combining different drugs.
We observe that for certain compounds (e.g., 17-AAG, Topotecan, PD-0325901), all the classification methods consistently have smaller PA values across the 16 cancer types, suggesting that the prediction accuracy may depend on the specific compound.Figure S4 displays the standard deviations of the PA values over 16 cancer types.We observe that the standard deviation for "17-AAG, " "Irinotecan, " and "AZD6244" is near 0.1, whereas the standard deviation of "PHA-665752, " "Nutlin-3, " and "PD-0332991" is near 0.01 for all the methods.These results imply that the variations of the prediction accuracy across different cancer types depend on the drugs considered.
Among the drugs considered, "Lapatinib" is an orally active drug for breast cancer, and its treatment has been found to induce prevalent resistance (Liu et al. 2009;Wang et al. 2011).We observe that the proposed MLS provides a more accurate prediction for the breast cancer cell lines treated with "Lapatinib." "17-AAG" is another effective compound for breast cancer (Modi et al. 2011).We also observe that MLS provides a higher PA for the breast cancer cell lines for this compound.In addition, "PD-0332991" can also be used for the treatment of ER-positive and HER2-negative breast cancer.In this case, we see that all the methods predict the response well for the breast cancer cell lines.
Regarding drugs for lung cancer treatment, "PF2341066, " an anti-cancer drug acting as an ALK (anaplastic lymphoma kinase) and ROS1 (c-ros oncogene 1) inhibitor (Forde and Rudin 2012; Roberts 2013), is approved for lung cancer in the United States and other countries."TAE684" is also effective to inhibit cell proliferation and used for the treatment of lung cancer.For the lung cancer cell lines, MLS and SVMG predict the responses of these two compounds well compared with other methods.Regarding other drugs related to the other cancer types, "Erlotinib" is used for both lung cancer and pancreatic cancer.For the lung and pancreatic cancer cell lines, all the methods predict the response of "Erlotinib" well.As studied in Scotlandi et al. (2005), "AEW541" is one of the growth factor-I receptor kinase inhibitors applied to musculoskeletal tumors such as cancer that develops in bone or soft tissue.For the bone and soft cancer cell lines, all the methods have similar prediction accuracy for the response of "AEW541." As demonstrated in Petrucci et al. (2012), "LBW242" can be used for the treatment of ovarian cancer.For the ovary cancer cell lines, we observe that all the methods similarly predict the response of "LBW242." For pancreatic cancer, the administration of "AZD6244" can significantly inhibit tumor growth in the pancreatic tumor xenograft model (Yeh et al. 2007).MLS and SVMG predict the response of "AZD6244" well compared with the other methods for the pancreatic cancer cell lines.
Note that some of the 24 drugs in the CCLE data have been used for multiple cancer types.For example, "Topotecan" is used to treat ovarian cancer when other treatments have failed and is also used to treat lung cancer.For the lung cancer cell lines, MLS predicts the response of "Topotecan" well compared with the other methods.For the ovarian cancer cell lines, SVMG predicts the response of "Topotecan" well; MLS is the second best."Paclitaxel" is used for the treatment of breast, ovarian, lung, and other types of solid tumor cancers.For the breast cancer cell lines, MLS predicts the response of "Paclitaxel" well; for the lung cancer patients, all the methods predict similarly; however, for the ovarian cancer cell lines, SVM predicts well compared with the other methods.Overall, for the drugs known to be effective for the treatment of certain cancer types, we observe that the proposed MLS predicts the response of the drug well in many cases or similarly compared with the other methods.
Further, we apply the Ising model proposed in Section 2.4 to investigate the dependency structures among drugs.Because the prediction accuracy of the proposed Ising model is similar to MLS, we mainly consider conditional associations between drugs.Specifically, for each pair of drugs k and l, we record θ k,l = p j=1 ( θ k,l j ) 2 , which represents the degree of dependencies between drugs k and l, across all the genes.These values are shown as a heatmap in Figure 4, where white colors between two drugs indicate weak associations whereas black colors indicate strong associations.
Specifically, TAE684 and PF2341066 have an estimated dependency of 0.97, which is the highest value.Both TAE684 and PF2341066 are Kinase Inhibitors, their mechanism of action is known to ALK Inhibitor, and were synthesized following published procedures (Galkin et al. 2007;Li et al. 2011).RAF265 and L-685458, both Kinase Inhibitors, have the second highest dependency value of 0.88.Both TAE684 and AEW541 are also Kinase Inhibitors, and they have the third highest dependency 0.85.For ZD-6474, only Erlotinib has significant non-zero value.Both ZD-6474 and Erlotinib are EGFR inhibitors, and that ZD-6474 is only grouped with Erlotinib according to the results in Wei et al. (2019).For PD-0332991, only Nilotinib has a nonzero value.This implies that PD-0332991 may only have an association with Nilotinib, which is consistent with the clustering results in Wei et al. (2019).For PHA-665752, only PF2341066 and Topotecan have nonzero values.Note that both PHA-665752 and PF2341066 are small molecule inhibitors related to c-met (Zillhardt, Christensen, and Lengyel 2010), and PHA-665752 and Topotecan are grouped together in Wei et al. (2019).For Lapatinib, AZD0530 has the highest value 0.68.Note that these drugs are EGFR inhibitors.For AZD0530, Nilotinib has the highest value 0.82.Both Nilotinib and AZD0530 are related to BCR-ABL (Sabitha 2012).Specifically, Nilotinib inhibits several tyrosine kinases including BCR-ABL, and AZD0530 targets BCR-ABL kinase activity and reduces the leukaemic maintenance by BCR-ABL.

Discussion
Although recent years have seen the adoption of SCAD as a popular penalization approach for a more parsimonious and consistent model selection in a wide range of high dimensional settings, there has been little work for rank selection using SCAD in the high-dimensional matrix literature.In this article, we considered the use of a SCAD-type penalization coupled with information criterion for both rank and row selection consistency.More specifically, we have introduced a novel sparse reduced rank regression with correlated multiple binary responses.We have proposed a new penalized likelihood method using SCAD version of the nuclear norm, developed associated novel high-dimensional theory, and conducted comprehensive data analyses.We found that the SCAD penalty function may also benefit low-rank matrix estimation in other high dimensional settings.Our theoretical work needs more mild conditions rather than strong irrepresentable condition assumed for the ordinary nuclear norm penalization (Bach 2008;Kong et al. 2020).To account for the dependencies among binary responses, we also developed a novel estimation procedure and theory for an Ising model in the multiple response logistic regression setting.
As matrix responses are commonly encountered in a wide range of applications as also seen in Kong et al. (2020), a natural extension of our work is to develop both low-rank and row-sparse procedures for high-dimensional matrix responses.Moreover, it is also of interest to extend the sure independence screening procedure of Kong et al. (2020) to binary matrix response settings, which may help ultra-high dimensional data analysis.

Figure 1 .
Figure 1.Performances of the five methods.All cancer cell lines and drugs are included when computing the performance measures.

Figure 2 .
Figure 2. Performances of the five classification methods.Responses of each drug across all cancer lines are included when computing performance measures.

Figure 3 .
Figure3.Performances of the five methods for the four selected cancer types, "Kidney, " "Liver, " "Pancreas, " and "Ovary." For each pair of drug and cancer type, the corresponding cancer cell lines and drugs are included when computing performance measures.

Table 1 .
Support set accuracy for the three methods when n = 400.

Table 2 .
Rank estimate accuracy for the two methods when n = 400.