A Regression Modeling Approach to Structured Shrinkage Estimation

Abstract Problems involving the simultaneous estimation of multiple parameters arise in many areas of theoretical and applied statistics. A canonical example is the estimation of a vector of normal means. Frequently, structural information about relationships between the parameters of interest is available. For example, in a gene expression denoising problem, genes with similar functions may have similar expression levels. Despite its importance, structural information has not been well-studied in the simultaneous estimation literature, perhaps in part because it poses challenges to the usual geometric or empirical Bayes shrinkage estimation paradigms. This article proposes that some of these challenges can be resolved by adopting an alternate paradigm, based on regression modeling. This approach can naturally incorporate structural information and also motivates new shrinkage estimation and inference procedures. As an illustration, this regression paradigm is used to develop a class of estimators with asymptotic risk optimality properties that perform well in simulations and in denoising gene expression data from a single cell RNA-sequencing experiment.


Introduction
This article studies shrinkage estimation problems where multiple parameters are simultaneously estimated under a risk function that aggregates the individual errors (Robbins 1951). In the typical formulation, the observed data consist of independent random variables Y i , i = 1, . . . , n, with Y i ∼ F θ i . The distributions F θ i are indexed by parameters of interest θ i . The goal is to estimate the entire vector θ = (θ 1 , . . . , θ n ) . Let δ i (Y) be a decision rule that uses the observed data Y = (Y 1 , . . . , Y n ) to estimate θ i , and let L(θ i , δ i ) be its associated loss. The problem is to find a decision rule δ = (δ 1 , . . . , δ n ) that minimizes the aggregate risk function where the θ i are fixed constants and the expectation is taken over Y. A canonical example is the Gaussian sequence problem (Johnstone 2017), where Y i ∼ N(θ i , 1) and L(θ i , δ i ) = (θ i −δ i ) 2 . Such problems arise in many areas of theoretical and applied statistics, including nonparametric function estimation (Tsybakov 2008;Cai 2012), multiple hypothesis testing (Sun and Cai 2007), and statistical genomics (Smyth 2005;Robinson, McCarthy, and Smyth 2010;Love, Huber, and Anders 2014).
Their key characteristic is that the decision rule that minimizes the individual risks of estimating each single θ i may not be the one that minimizes the aggregate risk (Robbins 1951). For example, in the Gaussian sequence problem, Stein (1956) showed that the rule δ i (Y) = Y i , which is optimal for estimating a single θ i , is inadmissible for the simultaneous estimation problem when n ≥ 3. under existing paradigms, as reviewed in Section 2.2, methods that can incorporate structural information were not available until recently, when Greenshtein, Mantzura, and Ritov (2019) proposed a strategy for fitting structural information into the empirical Bayes approach.
This article proposes that some of the challenges raised by structural information can be resolved by adopting a third, less studied paradigm of shrinkage estimation. Originally developed by Stigler (1990), this paradigm views simultaneous estimation as a type of regression problem. This perspective can naturally accommodate both structural and covariate information and provides an alternate interpretation of the empirical Bayes approach of Greenshtein, Mantzura, and Ritov (2019). Furthermore, the connection to regression motivates new shrinkage estimation and even inference procedures that are unattainable under the two more dominant paradigms.
Section 2 reviews relevant work and the regression paradigm is introduced and illustrated in Section 3. Sections 4 and 5 present theoretical and numerical results, respectively, and Section 6 applies the proposed strategy to denoise single-cell RNA sequencing data. The article concludes with a discussion in Section 7 and proofs can be found in the supplementary materials. All methods are implemented in the R package cole.

Common Simultaneous Estimation Paradigms
Specialized approaches for some classes of simultaneous estimation problems have been developed, especially for sparse θ (Donoho andJohnstone 1994, 1995;Castillo and van der Vaart 2012;Martin and Walker 2014;Zhang and Bhattacharya 2017). Otherwise, most existing shrinkage estimators are viewed under two paradigms: geometric shrinkage or empirical Bayes. The geometric paradigm (Stein 1956(Stein , 1962 views estimators as shrinking Y either toward the origin (James and Stein 1961) or toward a fixed subspace of R n (Lindley 1962) believed to lie close to θ . These estimators dominate the usual maximum likelihood estimator Y for sufficiently large n. See Fourdrinier, Strawderman, and Wells (2018) for a review of the extensive literature under this paradigm.
The empirical Bayes paradigm (Robbins 1955(Robbins , 1964 considers only separable decision rules, defined as δ i (Y) = d(Y i ) for a fixed function d. These rules can only use the ith observation to estimate the ith parameter. Robbins (1951) observed that the separable rule that minimizes the aggregate risk is the oracle rule Though the θ i are nonrandom, the empirical Bayes paradigm views G and d as a prior distribution and its corresponding Bayes rule, respectively. Empirical Bayes methods estimate the oracle d by estimating G from the observed Y, either parametrically or nonparametrically, and then calculating its Bayes rule. Some empirical Bayes estimators can asymptotically achieve nearly the same aggregate risk as the oracle separable rule (Brown and Greenshtein 2009;Jiang and Zhang 2009). Greenshtein and Ritov (2009) showed that under mild conditions, the risk of this oracle can closely approximate the risk of the optimal rule among the larger class of estimators that are invariant to permutations of the indices i = 1, . . . , n.

Incorporating Covariate Information
Frequently, external covariates C i1 , . . . , C ip are also observed for each i, and recent research has mostly focused on how these can be incorporated into shrinkage estimators. Approaches stemming from the geometric shrinkage paradigm use covariates to construct a p-dimensional subspace of R n when p < n and shrink Y toward this subspace to estimate θ (Sclove, Morris, and Radhakrishnan 1972;Oman 1982;Tan et al. 2015;Tan 2016). Approaches stemming from the empirical Bayes paradigm extend the notion of a separable decision rule to include rules of the form δ i = d(Y i , C i1 , . . . , C ip ), so that estimation of θ i can use not only Y i but also the ith set of covariates. They target the oracle Bayes rule This conditional prior can be modeled parametrically (Xie, Kou, andBrown 2012, 2016;Kou and Yang 2017;Ignatiadis and Wager 2019b), semiparametrically (Jiang and Zhang 2010;Cohen, Greenshtein, and Ritov 2013), or nonparametrically (Jing et al. 2016;Gu and Koenker 2017;Feng and Dicker 2018;Weinstein et al. 2018;Fu, Sun, and James 2019), and the corresponding d can be estimated accordingly. The semi-and nonparametric estimation methods may be infeasible for even moderately large p due to the curse of dimensionality, so parametric assumptions may be necessary.
Other approaches to incorporating covariate information exist as well. Banerjee, Mukherjee, and Sun (2019) adapted methods for estimating sparse θ . Zhao (2020) estimates d by directly minimizing an estimate of its aggregate risk, rather than estimating a prior distribution. In fact, several empirical Bayes methods also use direct risk minimization ideas (Xie, Kou, andBrown 2012, 2016;Kou and Yang 2017;Ignatiadis and Wager 2019b), though the forms of their estimators are motivated by parametric models for G(t | c 1 , . . . , c p ).

Incorporating Structural Information
Structural information poses challenges to the geometric shrinkage and empirical Bayes shrinkage estimation paradigms. Suppose it were known that θ i and θ i+1 are similar for i = 1, . . . , n, where θ n+1 is defined to equal θ 1 . This suggest that both Y i and Y i+1 should be used to estimate θ i . In the geometric paradigm, this implies that Y should be shrunk toward a subspace containing (Y 2 , Y 3 , . . . , Y n , Y 1 ) , but this subspace is now random, which is not allowed by existing methods. In the empirical Bayes paradigm, this suggests that the decision rule should depend on both Y i and Y i+1 , but this rule is nonseparable.
There appear to be very few shrinkage estimators that can make use of structural information. An exception is the recent work of Greenshtein, Mantzura, and Ritov (2019), who transformed the structural information setting into one amenable to the empirical Bayes paradigm. For Y i ∼ N(θ i , 1), they use structural information to develop an initial estimate (θ 1 , . . . ,θ n ) of θ , whereθ i cannot depend on Y i but can depend on Y i , i = i. They then argued that if theθ i are accurate enough, the transformed data Y i −θ i should be in a sense approximately exchangeable across the indices i, so that it is natural to estimate the mean of the residuals using separable rules d(Y i −θ i ) that do not depend on i. Second, they performed this estimation using a nonparametric empirical Bayes procedure and then add the estimates toθ i to give a final estimate for θ i . Section 3.2 shows that the approach of Greenshtein, Mantzura, and Ritov (2019) has a natural interpretation under the proposed regression paradigm. Furthermore, the regression paradigm leads to alternate estimators as well as novel inference procedures.

Problem Formulation
Let Y i have mean θ i and finite variance σ 2 i , and let Y 1 , . . . , Y n be independent. Assume that the σ 2 i are either known or unbiased estimatesσ 2 i are available. Further suppose that covariates C i1 , . . . , C ip are observed for each i, where each C ij is either fixed or independent of Y. Finally, suppose that structural information about θ is also available and can be expressed as neighborhoods N i = {i 1 , . . . , i q } about each i, such that θ i k may be informative for θ i for k = 1, . . . , q and i / ∈ N i . For example, if θ i represents the expression level of gene i, N i can contain the indices of q genes whose expression levels are thought to be most similar to that of gene i. The θ i and σ 2 i are assumed to be nonrandom constants. Throughout, p and q are assumed to be constant with respect to n.

Regression Paradigm
As discussed in Section 2.3, incorporating structural information into the geometric or empirical Bayes shrinkage estimation paradigms is not straightforward. This section proposes that structural information can more naturally be accommodated by a third, lesser-known paradigm. First articulated by Stigler (1990), this paradigm views a shrinkage estimator as a fitted regression model that predicts θ i using the observed data. For example, given Y i ∼ N(θ i , 1), Stigler (1990) first imagined that both θ i and Y i are observed. Then a natural way to predict θ i using Y i would be to fit the linear model β 0 + β 1 Y i , where estimatesβ 0 andβ 1 of the regression coefficients could be calculated using ordinary least squares. This is not possible in practice because the θ i are not known, but Stigler (1990) illustrated how the observed Y i can be used to obtain reasonable estimatesβ 0 andβ 1 ofβ 0 andβ 1 . The resulting fitted linear model usingβ 0 andβ 1 turns out to correspond to the classical Efron-Morris shrinkage estimator (Efron and Morris 1973), which also has both geometric and empirical Bayesian interpretations. While this regression paradigm has mostly been used as a descriptive device to reinterpret existing shrinkage estimators, this article proposes to use it as a prescriptive device for developing new ones. One advantage is that it can naturally incorporate covariates and structural information by simply including them as predictors in the regression model. The general approach is exactly analogous to standard regression problems. A first model-building step specifies how the data, covariates, and structural information should be used to estimate θ i . For example, a regression model might take the form This regression perspective offers an instructive reinterpretation of the empirical Bayes approach of Greenshtein, Mantzura, and Ritov (2019) to the structural information problem. For Y i ∼ N(θ i , 1), Greenshtein, Mantzura, and Ritov (2019) first defined an initial estimatorθ i of θ i that makes use of structural information but cannot depend on Y i itself. This can be viewed as fitting an initial model (Zhang 2003) imply that in this setting, the oracle estima- Greenshtein, Mantzura, and Ritov (2019) can therefore be reinterpreted as a fitted version of the regression model where f 1 and the ν i are unknown regression parameters.
This interpretation illustrates some advantages of the regression paradigm. First, it suggests that instead of fitting f 1 and the empirical Bayes estimator in two separate steps, it could be beneficial fit them jointly, as the latter involves the former. This could be accomplished, for example, by extending the approach of Zhao (2020), who fit a nonparametric empirical Bayes estimator by minimizing an estimate of its aggregate risk (1). Second, it suggests alternate regression models that can be simpler to fit and interpret. Finally, it suggests new estimation and even inference methods, borrowing from established regression methods. The second and third ideas are explored below.

Methods
Specifying a good model in the model-building step of the regression paradigm can be difficult. Model (2) is reasonable but may not work well when the Y i are not normally distributed. To demonstrate the regression paradigm, this section instead proposes simple shrinkage estimators based on linear models, the default model choice in standard regression problems. The estimators are similar to linear empirical Bayes shrinkage methods except that they incorporate structural information. More complex models, such as (2), can also be specified. Define the p-dimensional vectors where the p observed covariates have been partitioned into the vectors C Ci , C Yi , and C ki , k = 1, . . . , q. This article models θ i using The covariates in C Ci can be interpreted as main effect terms. For example, if the σ 2 i were known, as in the heteroscedastic normal means problem, C Ci could taken to be (1, σ 2 i ) . The C Yi contain terms that interact with Y i . For example, C Yi = (1, 1/σ 2 i ) allows Y i to exert more of an effect on the decision rule when σ 2 i is smaller. This is sensible because Y i is a better estimate of θ i for smaller σ 2 i . Finally, C ki interacts with the Y i k . See Section 1 in the supplementary materials for more examples.
The aggregate risk function (1) for model (3) becomes Model (3) is ideally fit by minimizing R n (β) over β, but R n (β) cannot be calculated because it depends on the unknown θ i . Instead, define the empirical risk function whereσ 2 i is an unbiased estimate of σ 2 i and Z = iσ 2 i ∂X i /∂Y i . When the σ 2 i are known, as in the heteroscedastic Gaussian sequence problem,σ 2 i can be taken to equal σ 2 i . On the other hand, for certain distributions of Y i it may not be sensible to assume known Proposition 1. The empirical risk function satisfies ER n (β) = R n (β).
Proposition 1 shows thatR n (β) is an unbiased estimate of R n (β). If the Y i were normally distributed, this would be a direct consequence of Stein's lemma (Stein 1981), but Proposition 1 holds for nonnormal Y i thanks to the simplicity of regression model (3). It is also possible to extend this risk estimate to more complex models. It is straightforward to show that the risk of a regression model polynomial in Y i with degree d can be estimated as long as unbiased estimates of all central moments of Y i up to the (d + 1)th moment are available (Biscarri 2019).
This article proposes to fit model (3) by minimizing the empirical risk (5) over β. Two procedures that accomplish this task are introduced here and implemented in the R package cole. Define the n × p design matrix X = (X 1 , . . . , X n ) . The proposed unconstrained estimator iŝ which exists if p < n. The corresponding estimate of θ is Xβ n , and if X contains a column equal to Y, for a fixed compact set M n ⊂ R p , so that the corresponding estimate of θ is Xβ M n . This type of constrained estimator is not commonly found in the shrinkage literature, but is natural when the problem is viewed from the regression context. A useful choice for M n is {β : β 1 ≤ M n } for a nonrandom M n that can grow with n, which will give a lasso-type procedure. This constrained estimator can be extremely useful when the number of covariates exceeds n, as in gene expression denoising example in Section 6. The M n can be chosen using cross-validation, similar to what is done in the standard regression setting. This is described in detail in Section 5.1.
One advantage of the regression paradigm is that it suggests inference procedures for shrinkage estimation, by analogy with inference in standard regression modeling. Such results have so far been rare in the literature. One application is to provide distributional results for theβ n , which can help quantify the significance of the contribution of each component of X i to estimating θ i . See Section 4.2 for a more detailed discussion.

Proposed Constrained Estimator
Another advantage of the regression paradigm is that theoretical properties of the resulting shrinkage estimators can sometimes be established by extending analogous results from the standard regression setting. This is demonstrated for the methods proposed in Section 3.3. The constrained estimator is the simpler of the two proposed estimators to study, due to the compactness of the parameter space. The following assumption is required to ensure that the empirical risk function is asymptotically wellbehaved. (3). Each of the following is O(n) for each j and k = 1, . . . , q: In a Bayesian interpretation where the covariates C ij , population-level parameters of Y i , and var(σ 2 i ) are thought of as being drawn from a joint prior distribution, Assumption 1 controls their moments and covariances. This is similar to conditions (A) and (B) of Theorem 3.1 of Xie, Kou, and Brown (2012) and conditions in Jiang and Zhang (2009) on the pth weak moment of the prior. Assumption 1c also considers covariances with moments of the neighboring Y i k .
To characterize the asymptotic risk of the constrained estimator, define the loss function so that E n (β) = R n (β), and the oracle constrained leastsquares estimatorβ M n = arg min β∈M n n (β).
Theorem 1 shows that the empirical risk functionR n (β) is close to the true loss function n (β) uniformly over β ∈ M n for the compact set M n ⊂ R p .
Theorem 1. If sup β∈M n β ∞ = o(n 1/2 ), then under Assumption 1, Theorem 1 suggests that minimizingR n (β) over M n should be asymptotically equivalent to minimizing n (β), so the proposed constrained estimatorβ Theorem 2 shows thatβ M n can achieve the same risk as the oracle ordinary least squares estimator of β calculated with complete knowledge of the θ i . Similar properties have been shown for previously proposed shrinkage estimators without structural information (Xie, Kou, andBrown 2012, 2016;Kou and Yang 2017).

Proposed Unconstrained Estimator
The behavior of the unconstrained estimatorβ n (6) is a bit more difficult to characterize compared to the constrained estimator β M n . Convergence ofR n (β) to n (β) is not uniform over all β ∈ R p , so the asymptotic risk ofβ n cannot be studied using the above approach. However, the close relationship between the proposed method and regression modeling suggests thatβ n can be studied using oracle inequalities for misspecified linear models (Rigollet and Hütter 2017, chap. 3).
An additional assumption, similar in spirit to Assumption 1, will be necessary. (3). Each of the following is O(n) for each j, j , and k, k = 1, . . . , q: To derive the asymptotic risk optimality of the unconstrained estimator, define the oracle least-squares estimator The following theorem shows thatβ n −β n converges to zero in probability under some conditions on the index neighborhoods N i that encode the structural information. Define so that D i is the number of indices that either neighbor i or include i in their neighborhoods.
Theorem 3. If E(X X/n) converges to a positive-definite matrix and D n = o(n), then under Assumptions 1 and 2,β n −β n = o P (1).
Theorem 3 can be used to show that the loss of the decision rule usingβ n is asymptotically as low as that of the rule usingβ n , in probability. This result is not as strong as the analogous result in Theorem 2 forβ M n , which establishes that the losses converge in expectation. Nevertheless, Theorem 3 still indicates thatβ n is a useful surrogate forβ n .
Theorem 4. If n −1 Z converges to a constant, with Z defined as in (5), then under the conditions of Theorem 3, n (β n )− n (β n ) = o P (1).
Under the regression modeling perspective, it is natural to pursue distributional results forβ n . These can derived under stronger conditions, particularly on D n (11).

Theorem 5. Define
If V n converges to a positive-definite matrix and D n = O(1), then under the conditions of Theorem 3, where I p is the p × p identity matrix andˆ n = (X X/n) −1 V n (X X/n) −1 .
Section 2 of the supplementary materials presents an estimatê V n of V n (12), which can be shown to be unbiased if certain central moments of Y i are known.
Proposition 2. If σ 2 i , κ 3i , and κ 4i are known, where κ 3i and κ 4i are the third and fourth central moments of Y i , respectively, then EV n = V n forV n defined in Equation (1) of the supplementary materials.
The conditions of Proposition 2 are reasonable in several important settings. In the heteroscedastic Gaussian sequence problem with Y i ∼ N(θ i , σ 2 i ) and known σ 2 i , κ 3i = 0 and κ 4i = 3σ 4 i . With a few additional assumptions similar to Assumptions 1 and 2,V n can also be shown to be consistent. When assuming known higher central moments is not sensible, for example for Poisson-distributed Y i , a consistent estimate of V n (12) may still exist but requires a separate calculation.
Theorem 5 and Proposition 2 allow for new inferential methods for shrinkage estimation. For example, they can be used to construct confidence intervals for the components ofβ n . However, the interpretations of these intervals requires care, because they do not cover population-level parameters but rather components of the oracle estimatorβ n (9). This is still informative, becauseβ n quantifies how useful the components of X i are for estimating θ i conditional on Y, and this conditional setting is usually the setting of interest in shrinkage estimation. Confidence intervals forβ n can be useful in several ways, for example to identify important covariates or structural information, or to aid in model-building. It is also possible to use these results to develop intervals for the estimated θ i . Such intervals were developed for empirical Bayes shrinkage estimators by Ignatiadis and Wager (2019a).

Methods
This section studies the performances of the proposed unconstrained (6) and constrained (7) estimators on simulated data. The constrained estimator was implemented using the lassotype constraint set M n = {β : β 1 ≤ M n }. The tuning parameter M n was chosen using 3-fold cross-validation, where the observed Y i were successively split into training and testing folds, estimators were fit for 11 equally spaced choices of M n between 0 and 5, and the performance of each fit was evaluated using the mean squared error between the predicted Y i and the true Y i in the testing fold. The optimal M n was used to refit the entire data to produce estimates for θ i .
The proposed methods were compared to several alternatives: the methods of Weinstein et al. (2018) for heteroscedastic Gaussians, Xie, Kou, andBrown (2012, 2016) for heteroscedastic Y i from an exponential family with a quadratic variance function, Kou and Yang (2017) for heteroscedastic Y i with covariate information, and Greenshtein, Mantzura, and Ritov (2019) for homoscedastic Gaussians with structural information. The parametric method of Kou and Yang (2017) performed much better than the nonparametric version and thus is the only one reported. To adapt their method to heteroscedasticity, Greenshtein, Mantzura, and Ritov (2019) suggested standardizing the Y i to unit variance after including the σ 2 i as covariates. This was done here. Finally, the oracle unconstrained estimatorβ n (10) was implemented as well.

Setting I
The θ i were defined by first constructing a piecewise constant function of the index i, where the lengths of each piece were randomly generated from Poisson(25) and the constant values were N(0, 1); see Figure 1. The θ i were generated by adding N(0, 1) noise to each point of this piecewise function. Covariates were also generated: N(0, 4) was not. Finally, structural information was generated by taking advantage of knowledge of piecewise constant-like nature of the θ i , by letting N i = {i−2, i−1, i+1, i+2} with indices −1, 0, n+1, and n + 2 are defined to be n − 1, n, 1, and 2, respectively. All quantities except Y i were fixed across simulation replications. Figure 1 illustrates n = 100 data points generated under this setting.
All shrinkage estimators were first implemented incorporating only σ 2 i . For the methods of Kou and Yang (2017) and Greenshtein, Mantzura, and Ritov (2019), this was done by including an intercept and σ 2 i as covariates. The proposed method was implemented with C Ci = (1, σ 2 i ) and C Yi = (1, 1/σ 2 i ), so that the effect of Y i in regression model (3) would be stronger for smaller σ 2 i . Next, all methods except those of Weinstein et al. (2018) and Xie, Kou, andBrown (2012, 2016) were implemented again after adding Y i1 and Y i2 to the covariates C Ci . Finally, the methods of Kou and Yang (2017), Greenshtein, Mantzura, and Ritov (2019), and the proposed methods were implemented a third time after adding structural information. For the first two methods, Y i k for i k ∈ N i were added as covariates, and for the last method C ki was set to 1 for k = 1, . . . , 4.

Setting II
The Y i were generated from negative binomial distributions, roughly following the simulation procedure of Xie, Kou, and Brown (2016). Mean parameters μ i were generated to be equal to a piecewise constant function of the index i constructed as in Setting I, except that the constant values were randomly drawn from |N(0, 400)|. Size parameters m i were drawn from 4 + 4 Poisson(3). These were used to generate observed data Figure 1. Example data generated under simulation Setting I. The θ i were generated by adding noise to the piecewise constant function, σ 2 i were generated as a noisy function of θ i , and the Y i were generated from N(θ i , σ 2 i ).
Y i =Ỹ i /m i , whereỸ i was drawn from a negative binomial distribution with mean μ i and size m i . The parameter of interest was EY i = θ i , and the variance of Y i was σ 2 i = (μ i +μ 2 i /m i )/m 2 i . Covariates Y i1 and Y i2 and structural information N i were generated as in Setting I, and all quantities exceptỸ i were fixed across simulation replications.
The shrinkage estimators methods were implemented as above, but the method of Xie, Kou, and Brown (2012) was replaced with that of Xie, Kou, and Brown (2016), which is specifically designed for negative binomial Y i . The methods of Weinstein et al. (2018), Kou and Yang (2017), and Greenshtein, Mantzura, and Ritov (2019) were meant for Gaussian Y i but were directly applied here.

Setting III
This was motivated by recent work on denoising single-cell RNA sequencing data (Li and Li 2018), which modeled normalized gene expression levels using Gamma-Normal mixture distributions. The θ i were generated by first constructing a piecewise constant function as in Setting I, but where the constant values were drawn from |N(0, 1)|. The θ i were then generated by adding |N(0, 1)|/2 noise to this function. Covariate and structural information were generated as in Settings I and II, and all quantities except Y i were fixed across simulation replications.
In this setting, the σ 2 i were not assumed to be known. Instead, for each i, m additional observations were generated following the same distribution as Y i , and variance estimatesσ 2 i were calculated from these observations. These estimates were used when applying each of the shrinkage methods, which were implemented as in Setting I. Figure 2 reports the average mean squared errors over 200 replications of each of the simulation settings for different numbers of observations n. In every setting, using structural information improved estimation accuracy compared to using only σ 2 i and the covariates. This information can only be leveraged using the proposed methods or those of Kou and Yang (2017) and Greenshtein, Mantzura, and Ritov (2019). Among these, the proposed unconstrained estimator appeared to be either the best or among the best performers for n ≥ 500. The method of Greenshtein, Mantzura, and Ritov (2019) was only meant for Gaussian Y i , though in principle it could be extended to other distributions of Y i . The proposed unconstrained method and that of Kou and Yang (2017) performed similarly, though the former was better in Setting I and also has asymptotic risk optimality guarantees when using structural information.

Results
The different values of m used in Setting III illustrate the impact of using estimatesσ 2 i instead of the true σ 2 i . Estimation errors were lower when m = 100 observations, though results were generally not much worse when m = 10 except for the proposed constrained estimator. When m = 100, the method of Kou and Yang (2017) performed better than the proposed methods, especially for small n. This is likely because in Setting III, 80% of the Y i are Gamma-distributed, so they have means θ i and variances θ i /β i . The variances therefore provide a great deal of information about the means, and the method of Kou and Yang (2017) appeared to make better use of the variances. In general, the proposed methods seemed to outperform the method of Kou and Yang (2017) when σ 2 i was informative for θ i but not so strongly related, such as in Setting I.
To validate the asymptotic normality result of Theorem 5 and the covariance matrix estimate of Proposition 2, Figure 3 plots estimated densities of every component of n 1/2ˆ −1/2 n (β n −β n ) from every regression model fit in Settings I and II using the proposed unconstrained method for n ≥ 600. Theorem 5 requires the third and fourth central moments of Y i to be known, and in Settings I and II these could be calculated using the known distributions of Y i . Densities were not plotted for Setting III because the moments were estimated rather than known. The results matched the standard normal density fairly well.
Theorem 5 can thus be used to conduct inference. Table 1 reports the average oracle and estimated coefficients and their average standard errors for each of the regression models fit under Setting I when n = 1000. The oracle coefficients were all within two estimated standard errors of the estimated coefficients. The results also reflected the fact that the covariate Y i1 was informative for the θ i while Y i2 was not. However, they also illustrate an interesting nuance: the average oracle estimate for the coefficient of Y i2 was 0.013, not zero. This is because the Y i2 and the θ i were only generated once and then fixed across the replications. Thus, a nonzero correlation will exist between any fixed realization of Y i2 and θ i . This implies that Theorem 5 should generally only be used to construct confidence intervals for each componentβ nj of the oracle estimator (10), not to conduct hypothesis tests, as the value ofβ nj under the hypothesis of no association is not necessarily zero.

Data Analysis
Single-cell RNA-sequencing allows quantification of gene expression from individual cells and is quickly becoming a dominant technology in genomics research. A major issue is that the observations can be extremely noisy, which has inspired a great deal of recent work on statistical methods to denoise the observed gene expression levels. Denoising can be formulated as a simultaneous estimation problem. In a given cell, let Y i represent the observed expression level of gene i and θ i represent its true expression. The goal of denoising is to find estimatesθ i of θ i with low aggregate risk n −1 i E{(θ i −θ i ) 2 }. Many authors have recognized that denoising can be improved by leveraging external information. For example, sequencing experiments can involve thousands to millions of cells, and expression levels in these other cells can be used as covariates to help denoise a particular cell. Specifically, let C Cij be the expression level of gene i in another cell j from the same experiment; these should be informative for θ i (Li and Li 2018;Van Dijk et al. 2018;Wang et al. 2019). Recently, other authors (Elyanow et al. 2019) have also incorporated structural information in the form of gene-gene relationships. If genes i and i are known to be closely related, θ i and θ i are likely to be This section illustrates these ideas using single-cell RNAsequencing data from Zeisel et al. (2015), available from http:// linnarssonlab.org/cortex/. Expression levels were obtained from 3005 cells from the mouse somatosensory cortex, after adding 57 synthetic RNA molecules at known concentrations (Jiang et al. 2011). The data were normalized and cells were clustered following standard procedures using Seurat v3 (Stuart et al. 2019). In each cell, Y i for i = 1, . . . , 57 were taken to be the observed expression values of the synthetic RNA sequences.
The various shrinkage estimators described in Section 5.1 were applied to each cell to denoise the Y i . In each cell, σ 2 i was estimated using the sample variance of the ith molecule across all cells in the cluster to which cell belonged, and the third and fourth central moments κ 3i and κ 4i were estimated similarly.
Using variancê  To illustrate the impact of structural information, N i contained the indices of the two molecules with true concentration closest to that of molecule i. Finally, expression values from all other cells were included as covariate information. In other words, C Cij equaled the expression of molecule i from cell j, j = 1, . . . , 3004. The number of covariates exceeded the number of observations n, so the proposed constrained denoising method was the only one that could be applied. The estimators were compared in terms of the correlations between their denoised Y i and the true concentrations. This validation procedure follows Li and Li (2018). Spearman correlations were calculated in each cell, and Figure 4 reports boxplots of the 3005 correlations produced by each denoising procedure. Denoising was general beneficial, although some methods performed poorly in some cells. This may be because the methods were designed to have low risk, which does not necessarily imply high correlations. Incorporating structural information improved over methods that used only Y i and σ 2 i , though again the proposed unconstrained method gave lower correlations in some cells. Adding covariate information using the proposed constrained method gave the best results.
The distributional results in Theorem 5 and Proposition 2 were also employed to determine whether the two components of the oracleβ n corresponding to the structural information were significantly different from zero. The proposed unconstrained estimatorβ n was applied to each of the 30005 cells and p-values for the structural information were calculated using a two degree-of-freedom χ 2 -test. More than 94% of the cells had p-values less than 0.05, confirming that structural information was indeed useful for denoising.

Discussion
There are close connections between the regression and empirical Bayes paradigms. Empirical Bayes methods arrive at decision rules by first imagining a prior on the θ i and then deriving the corresponding Bayes rule. Parametric or nonparametric modeling occurs at the level of the prior, and these models are translated into a decision rule via the usual Bayesian machinery. The regression approach, on the other hand, can be thought of as directly modeling the Bayes rule without first appealing to a prior or even a likelihood for Y i . In some cases, the regression model for the decision rule may translate into a model for a prior. For example, models that are linear in Y i are implicitly imposing a conjugate prior on the θ i (Diaconis and Ylvisaker 1979).
The regression paradigm is especially well-suited to incorporating structural information into shrinkage estimators. Including Y i , i = i in a regression model for θ i poses neither practical nor conceptual difficulties. Indeed, such models are common in standard regression contexts, as for example including Y i−1 in a model for EY i is exactly an autoregressive model. The regression approach is also wellsuited to settings with a large number of covariates, as in the single-cell sequencing application in Section 6. Sophisticated methods such as the lasso have already been developed in standard contexts and can be directly applied here.
One issue with the approaches proposed in this article is that the simple regression model (3) may be far from optimal. On the other hand, it is difficult to define an oracle decision rule in the presence of structural information. If the form of the oracle were known, it could be used to specify a corresponding regression model. Otherwise, it may be possible to adapt diagnostic methods from the classical regression setting to the simultaneous estimation context. These may help identify situations where model (3) is not appropriate, and may suggest ways to improve the form of the model.

Supplementary Materials
The supplementary materials contain illustrations of the regression approach to shrinkage estimation, an estimator for the asymptotic variance of the proposed unconstrained estimator, and proofs of all results.