Inference and Estimation for Random Effects in High-Dimensional Linear Mixed Models

Abstract We consider three problems in high-dimensional linear mixed models. Without any assumptions on the design for the fixed effects, we construct asymptotic statistics for testing whether a collection of random effects is zero, derive an asymptotic confidence interval for a single random effect at the parametric rate , and propose an empirical Bayes estimator for a part of the mean vector in ANOVA type models that performs asymptotically as well as the oracle Bayes estimator. We support our theoretical results with numerical simulations and provide comparisons with oracle estimators. The procedures developed are applied to the Trends in International Mathematics and Sciences Study (TIMSS) data. Supplementary materials for this article are available online.


Introduction
In the past two decades, there has been a lot of progress in the theory for high-dimensional linear models.However, its close cousin, the high-dimensional linear mixed model, has received significantly less attention; it was not until the past decade until there were procedures for estimation.Consider a linear mixed model given by with Z ∈ R n×q , W ∈ R n×d , and Y, μ, ε ∈ R n ; the vector μ and the pair ν and γ are the fixed effects and the random effects, respectively.In addition, we observe covariates X ∈ R n×p such that μ ≈ Xβ for some sparse vector β ∈ R p (see Section 1.2 for a rigorous definition).Here, X is the component of the design corresponding to the fixed effects and (Z, W) the component corresponding to the random effects.We consider the setting where the random effects are low-dimensional, q + d < n, but the fixed effects are high-dimensional, p > n.We have separated the random effects in two to emphasize that later we are interested in ν and view γ as nuisance parameters.Various authors have considered different aspects of this problem.
The earliest work of Schelldorfer, Bühlmann, and van de Geer (2011) proposed an estimator for both β and the variance components using a lasso-type approach.These types of approaches were later extended by several authors who considered estimation with both convex penalties, such as Groll and Tutz (2014), and nonconvex penalties, such as Wang, Zhou, and Qu (2012).There is also a growing literature on model selection in highdimensional linear mixed models (see, e.g., the review article by Müller, Scealy, and Welsh 2013).
The problem of inference is slightly less well studied.To the best of our knowledge, hypotheses testing problems were first considered by Chen et al. (2015) for random effects and Bradic, Claeskens, and Gueuning (2017) for fixed effects.However, the work of Chen et al. (2015) only consider the special case of ANOVA designs for random effects.During the preparation of this manuscript, we became aware of the independent work of Li, Cai, and Li (2021), who consider the problem of inference in high-dimensional linear mixed models.In particular, they discuss inference for fixed effects and estimation of variance components.A more detailed comparison of our methodology with Li, Cai, and Li (2021) is deferred to Section 2.4.We also note that there is a parallel notion of high-dimensional mixed models, where the number of fixed effects is low-dimensional while the random effects are high-dimensional.Under this setting, Jiang et al. (2016) established asymptotic results for the restricted maximum likelihood for variance components.
The goal of the present article is to contribute to this growing literature on high-dimensional linear mixed models where the fixed effects are high-dimensional, both in terms of estimation and inference.In particular, we consider three related problems: 1. Testing whether a collection of random effects is zero.2. Constructing confidence intervals for the variance of a single random effect.3. Estimating using empirical Bayes in Gaussian ANOVA Type Models.
Our methodology is inspired by both low-dimensional linear mixed models as well as high-dimensional linear models.Specifically, our approach to all three problems starts with considering a procedure in the corresponding low-dimensional problem and retrofitting it with tools and techniques from high-dimensional linear models to produce a procedure for high-dimensional linear mixed models.Throughout the article, while we consider the general linear mixed effects models, we use the balanced one-way ANOVA model to simplify the discussion of our estimators and assumptions.

Organization of the Paper
We end the current section with a description of the notation that we adopt throughout the article.Sections 2, 3, and 4 consider the three problems outlined in the Introduction in succession.Each one starts with a description of the problem setup, a brief motivation from the low-dimensional problem, and a description of the estimator, that is, considered, and ends with some theoretical results.In Sections 5 and 6, we provide the results of our simulations and a real data application, respectively.For the ease of presentation, we defer all proofs and additional simulation results to the supplementary material.

Notation
Throughout, all of our variables have a dependence on n, but we suppress this dependence when it does not cause confusion.For a general vector a and matrix A, let a 2 denote the standard Euclidean norm with the dimension of the space being implicit from the vector, A 2 the operator norm, and A HS the Hilbert-Schmidt norm.Furthermore, if A is square, then λ max (A) and λ min (A) denote the maximal and minimal eigenvalue of A, respectively.For any k ∈ N, we let λ max,k (A) denote the kth largest eigenvalue of A if A is square.Moreover, we write 1 k ∈ R k and I k ∈ R k×k to denote the k-dimensional vector of all ones and the k-dimensional identity matrix, respectively.For two matrices A and B, the notation A B denotes the intersection of the column space of A and the orthogonal complement of the column space of B. Then, for a matrix A, we write P A to denote the projection onto the column space of A and P ⊥ A the projection onto the orthogonal complement.Moreover, we write r A to denote the rank of A.
Consistent with other high-dimensional works, we assume that β is a sparse vector.There are various notions of sparsity but we assume the slightly more general setting of weak sparsity from Law and Ritov (2021).Before providing the definition, we introduce some more notation.For u ∈ N, we let M u m ⊆ 1, . . ., p : |m| = u denote the collection of all models with the dimension of the fixed effects design equal to u.For a model m ∈ M u , X m denotes the n × u sub-matrix of X corresponding to the columns indexed by m.Definition 1.The vector μ is said to satisfy the weak sparsity property relative to X with sparsity s at rate k as n → ∞ if the set is nonempty.
Then, we let S ∈ S μ denote any weakly sparse set for μ.We note that the usual high-dimensional setting of strong sparsity, where μ = X S β S for |S| = s, implies that μ is weakly sparse relative to X with sparsity s.Similar to other works on highdimensional linear models and high-dimensional linear mixed models, we consider errors and random effects which are sub-Gaussian, for which we use the following definition: Note that if ξ is sub-Gaussian with parameter K and A ∈ R a×n is any deterministic matrix, then Aξ is also sub-Gaussian with parameter Kλ max (A T A).Finally, the asymptotic distributions of some of our estimators depend on the fourth moments of the underlying distributions.We write κ ε var(ε 2 1 ), ω ε E(ε 4 1 ), κ ν Var(ν 2 1 ), and ω ν E(ν 4 1 ) when ν corresponds to a single random effect.

Hypotheses Testing for Random Effects
In this section, we consider the problem of inference for a collection of random effects.Consider the high-dimensional linear mixed model (1) and let var(ν).We are interested in the hypotheses testing problem We propose two procedures in this section depending on whether ε is Gaussian.

Model and Motivation
Suppose temporarily that we are in the low-dimensional Gaussian setting with s = p, p + q + d < n, μ = X S β S , ε ∼ N n (0 n , σ 2 ε I n ) for some positive constant σ 2 ε > 0, and ν ∼ N q (0 q , ) for some symmetric positive semi-definite matrix .Then, in this problem, the standard procedure for testing ν is through the Wald F-test.Writing r Z (X S ,W) rank(P Z (X S ,W) ) and r (X S ,Z,W) ⊥ rank(P ⊥ (X S ,Z,W) ), the Wald F-test is defined as Under the null hypothesis, the above statistic has an F r Z (X S ,W) ,r (X S ,Z,W) ⊥ distribution.The main obstacle to directly using the Wald F-test in the high-dimensional setting is removing the contribution of the fixed effects.One possibility is to perform model selection and choose the relevant covariates from X and then use the Wald F-test.Chen et al. (2015) consider a similar problem in the growing dimensional setting and they use a SCAD based approach for variable selection.As a consequence, they require p = o( √ n).Instead, we leverage the fact that a projection onto a particular space is a regression onto a design whose columns span the same space.
Expanding both the numerator and the denominator of the Wald F-statistic, we have that In both matrices above, they project onto the orthogonal complement of W, which may still be achieved in the highdimensional problem since W is a low-dimensional matrix.
Thus, we may find two projection matrices, P Z W and P ⊥ (Z,W) , such that If P Z W X was low-dimensional, obtaining the projection of P Z W Y onto the orthogonal complement of P Z W X is equivalent to finding the residuals of P Z W Y using the covariates P Z W X; this yields P Z W Y − P Z W X β, where β is the least-squares estimator for β.The same holds for P ⊥ (Z,W) X and P ⊥ (Z,W) Y.Then, we have that Hence, this recasts the problem into one of high-dimensional prediction, for which there have been many procedures suggested to estimate P Z W Xβ and P ⊥ (Z,W) Xβ, such as the lasso and exponential weighting (see Tibshirani 1996 andLeung andBarron 2006, respectively).Therefore, we propose using a plug-in estimator for P Z W Xβ and P ⊥ (Z,W) Xβ using exponential weighting of all models of a particular size and then consider the resultant residuals.Since we view the fixed effects as nuisance parameters, we consider exponential weighting instead of the lasso since exponential weighting does not require any assumptions on the design matrix X.However, most of the theory developed also applies to other plug-in estimators, albeit with simple modifications and much stronger conditions.This idea, under some mild assumptions, provides an asymptotic F-test.
However, there are two asymptotic regimes for the random effects: (i) the number of random effects increases to infinity and (ii) the number of random effects stays bounded.These two settings require slightly different analyses, so we consider separate exponential weighting estimators for the two cases.
Besides providing an asymptotic F distribution when ε is Gaussian, the F-ratio in Equation (3) simultaneously removes the scaling effect from σ 2 ε .When ε is known only to be sub-Gaussian, the ratio no longer follows an F-distribution.However, after appropriate rescaling, we may still achieve the ancillary property relative to σ 2 ε by looking at the difference instead of the ratio.This approach, under slightly stronger sparsity assumptions, leads to an asymptotic z-test with only the sub-Gaussian assumption on the error distribution.

Estimator
In the setting, where the number of random effects increases to infinity, instead of estimating P Z W Xβ and P ⊥ (Z,W) Xβ separately, we estimate P ⊥ W Xβ and then project the resultant vector onto P Z W and P ⊥ (Z,W) , respectively.In addition to saving on computational time by only using exponential weighting once, this also allows us to leverage a larger sample size when estimating the mean vector.To apply exponential weighting, we fix a sequence of sparsities u = u n .Let βm denote the least-squares estimator of β using the model m ∈ M u with covariates P ⊥ W X m .Let K Zν+ε denote the sub-Gaussian parameter for Zν + ε.We define the exponential weights by where α > 4K Zν+ε .Then, the estimator for β is given by βEW m∈M u w m βm .
Note that the bound on α is to ensure P ⊥ W X βEW is a consistent estimator of P ⊥ W Xβ. In the case where both ν and ε are Gaussian, the above bound on α becomes α > 4 σ 2 ε + λ max Z Z T .Then, we estimate P Z W Xβ and P ⊥ (Z,W) Xβ by P Z W X βEW and P ⊥ (Z,W) X βEW , respectively.The corresponding F-statistic is Similar to the Wald F-statistic, we reject the null hypothesis for large values of F EW .In particular, for a value δ ∈ (0, 1), let F a,b,δ denote the δ upper quantile of the F a,b distribution.Then, we consider tests of the form For the second setting where the number of random effects stay bounded, we estimate the numerator differently.Let U (Z,W) ⊥ ∈ R n×r (Z,W) ⊥ be any orthogonal matrix such that for example, the matrix U (Z,W) ⊥ may be computed by taking the spectral decomposition of /2 be a partition of Ỹ.We similarly define X(1) and X(2) .Then, letting β(1) m (respectively, β(2) m ) denote the least-squares estimator of β using the model m ∈ M u with covariates X(1) m (respectively, X(2) m ), the exponential weights are defined as and similarly for w(2) m , where α is delineated in Theorem 4 below.Now, define Then, the estimator of β is and the corresponding F-statistic is At first sight, computation of these estimators may seem prohibitive since we need to aggregate over p u models.However, they may be well approximated by an MCMC algorithm given by Law and Ritov (2021), to which we refer the interested reader.
In the setting, where the ε is not distributed Gaussian, we consider the following z-statistic Under proper scaling, the statistic z EW has an asymptotic Gaussian distribution under the null hypothesis.Let with σ 2 ς,z a consistent estimator of σ 2 ς,z .The quantity σ 2 ς,z is the scaling factor to ensure a central limit for z EW .Then, letting z δ denote the δ upper quantile of the standard Gaussian distribution, we consider tests of the form A general discussion regarding σ 2 ς,z is deferred to Section 3.4.When ε is not Gaussian, we only consider the setting where the number of random effects increases to infinity since the analysis of z EW relies of a central limit theorem for quadratic forms.
As mentioned in Section 2.1, under appropriate conditions, we may also use the lasso instead of exponential weighting.For a suitable choice of λ > 0, define the lasso estimator of β as Then, the corresponding F-statistic is .

Assumptions
In this section, we make the following assumptions.
The mean vector μ = μ n is weakly sparse relative to X with sparsity s n , with weak sparsity defined in Definition 1. Furthermore, the statistician chooses a sequence of sparsities u n such that u n ≥ s n for n sufficiently large and Moreover, the number of observations in the reduced models, r Z W and r (Z,W) ⊥ , satisfy r Z W r (Z,W) ⊥ n. (A6) The mean vector μ = μ n satisfies μ = Xβ with β 0 = s n .Furthermore, the statistician chooses a sequence of sparsities u n such that u n ≥ s n for n sufficiently large and u n = o n τ / log(p) for some τ ∈ [1/2, 1].Moreover, the rows of X are independent and identically distributed Remark 1. Assumptions (A1) and (A2) are standard assumptions in high-dimensional linear models.Calling ε the noise and Zν + ε the random component, assumption (A1) is a scaling assumption to ensure the ratio of the fixed components to the random components remains bounded asymptotically and is analogous to the assumptions of Bradic, Claeskens, and Gueuning (2017), who assume that the population covariance matrix of X has bounded maximal eigenvalue and Next, (A2) is used for consistency of the prediction procedure under the null hypothesis and assumption (A3) allows for concentration of the prediction procedure under the alternative hypothesis.Both assumptions are used by various authors, such as Bradic, Claeskens, and Gueuning (2017) and Cai and Guo (2017).We note that (A2) and (A3) are implied by (A2*) and (A3*), respectively, but the additional Gaussian distribution assumption allows us to relate our methodology to the vast literature on low-dimensional Gaussian mixed models.
Next, the first part of assumption (A4), like (A1), ensures that the ratio of the fixed components to the random components of the variance noise ratio remains bounded under the alternative hypothesis.To elucidate this point, consider the Gaussian setting with ε ∼ N n (0 n , σ 2 ε I n ) and ν ∼ N q (0 q , σ 2 ν I q ).Then, , assumption (A4) bounds the variance of the noise.Moreover, (A3) and (A4) together imply that Zν is sub-Gaussian with parameter K ν λ max (ZZ T ).This requirement is similar to Condition 1 of Bradic, Claeskens, and Gueuning (2017) and Condition 3.1 of Cai and Guo (2017).The assumption that (Z, W) is independent of X is common in the literature (see the discussion before Condition 3.2 of Cai and Guo 2017).
The following two assumptions, (A5) and (A6), are about the sparsity of the fixed effects.The two assumptions consider different asymptotic regimes regarding the random effects; (A5) assumes that the number of random effects increases to infinity while (A6) allows for the number of random effects to stay bounded.The first part of both (A5) and (A6) is a sparsity assumption commonly found in the high-dimensional linear models literature, which is discussed further in Remark 2 below.Note that since the selected sequence of sparsities u n satisfies u n = o(n τ / log(p)), then the true sequence of sparsities s n also satisfies the same requirement.
The second half of (A5) is an assumption on the component of the design for the random effects, requiring the number of realizations of the random effects to increase to infinity.The requirement that r Z W r (Z,W) ⊥ n is for convenience and can be weakened to only min(r Z W , r (Z,W) ⊥ ) → ∞ if the sparsity requirement is accordingly relaxed to u n = o min(r Z W , r (Z,W) ⊥ ) τ / log(p) .The second half of (A6) is a technical requirement to ensure consistency of exponential aggregation for out-of-sample predictions.Since the number of random effects remains bounded, the regression of P Z W Y on P Z W X does not necessarily yield a consistent estimator of P Z W μ for arbitrary designs.With the Gaussian assumption, we may estimate β by regressing P ⊥ (Z,W) Y on P ⊥ (Z,W) X and obtain a consistent estimator of P Z W μ. Again, the requirement that r (Z,W) ⊥ n can be weakened to r (Z,W) ⊥ → ∞ by adjusting the sparsity requirement to u n = o(r (Z,W) ⊥ τ / log(p)).
Assumption (A7) is a mild assumption on the distribution of ε to ensure a central limit theorem.For example, (A7) is satisfied by the Gaussian distribution.Note that no assumption is necessary on γ as the nuisance parameters are projected out in the first stage.
Example 1 (Balanced one-way ANOVA).As an example of a design satisfying the above assumptions on Z, consider a balanced one-way ANOVA design, with q subjects, m observations per subject, and n = mq total observations.In this setting, there are no nuisance random effects, so d = 0. Assume further that the number of observations per subject remains bounded (i.e., m = O(1)), which is commonly satisfied in practice.Then, the matrix Z may be represented by Z = I q ⊗1 m .It is immediate that r Z W = q and r (Z,W) ⊥ = (m − 1)q, implying that the second half of (A5) is satisfied.Finally, assumption (A4) is satisfied since λ max (ZZ T ) = λ max (mI q ) = m.

Main Results
Since F EW is motivated by the classical F-statistic F ld , the following theorem shows that, up to a small bias term depending on the sparsity, the two statistics are asymptotically equivalent.
Theorem 1.Consider the model given in Equation (1) and the hypotheses testing problem from Equation (2).Assume (A1), (A2*), (A3*), (A4), and (A5).If α ≥ 4 σ 2 ε + λ max Z Z T , then As mentioned in Section 2.1, under the null hypothesis, the statistic F ld ∼ F r Z (X S ,W) ,r (X S ,Z,W) ⊥ .However, since the weakly sparse set S is unknown, the value of r Z (X S ,W) and r (X S ,Z,W) ⊥ cannot be determined in practice.From assumption (A5), as s = o(n τ / log(p)), then F r Z (X S ,W) ,r (X S ,Z,W) ⊥ = F r Z W ,r (Z,W) ⊥ + o P (1).Thus, the statistic F EW can also be compared to the reference distribution F r Z W ,r (Z,W) ⊥ .
Despite being asymptotically equivalent to the Wald F-test, F EW has an additional bias term of o P (n τ −1 ), which impacts the power of the testing procedure.This leads us to consider the following hypotheses testing problem; for any τ ∈ [1/2, 1], we consider the contiguous hypotheses Example 2. Consider the setting of Example 1 with ν corresponding to a single random effect and ν ∼ N q (0 q , σ 2 ν I q ).Then, with τ = 1/2, the above hypotheses becomes which is a standard hypotheses testing problem, such as in the balanced one-way random effects model.In this model, in the low-dimensional setting, the rate of √ n is optimal.
Remark 2. The above theorem implies that F EW can distinguish at the classical parametric √ n rate if the model is in the ultra-sparse regime, s = o( √ n/ log(p)).This sparsity rate is common in high-dimensional inference problems for lowdimensional parameters at the parametric rate; in particular, for high-dimensional linear models, a version of this rate is required (cf.Cai and Guo 2017;Javanmard and Montanari 2018).When the value of τ ∈ (1/2, 1], we are limited by the ability to remove the bias from the mean vector; in the setting where τ = 1/2, we are limited by the noise level.This seems to suggest a trade-off between the sparsity and the achievable rate of separation. This comparison with the linear models literature that the inferential procedure requires an additional factor of √ n for sparsity assumption appears to be consistent with the recent results by Li, Cai, and Li (2021).In particular, their proposed estimator for the variance components requires a consistent estimator of β.They show in Theorem 3.1 that the minimax rate for estimating β is s log(p/s 2 )/tr( −1 a ), where a ∈ R n×n is a proxy for the true covariance matrix of Y. Thus, this suggests that tr( −1 a ) n, and they require s log(p)/n → 0 to consistently estimate the variance components.
Remark 3. Compared to the recent work of Li, Cai, and Li (2021), who only suggest an asymptotic distribution for their variance components estimators, Theorem 1 also demonstrates that F EW enjoys certain optimality properties.In addition to providing a distribution under the null hypothesis, Theorem 1 also demonstrates under a sparsity assumption, F EW is asymptotically equivalent to the classical Wald F-test, which is known to enjoy certain optimality properties, such as uniformly most powerful unbiased and uniformly most powerful invariant unbiased in certain ANOVA models (cf.Mathew and Sinha (1988)).In addition, Lu and Zhang (2010) showed that the Wald F-test and likelihood ratio tests are equivalent for balanced oneway ANOVA models while Qeadan and Christensen (2020) showed that the Wald F-test renders the likelihood ratio test inadmissible in generalized split plot designs.Moreover, unlike Li, Cai, and Li (2021), who assume a compatibility condition, our procedure imposes no such requirement on the design matrix X.
We now turn our attention to the setting of sub-Gaussian errors.When τ > 1/2, z EW no longer has an asymptotic Gaussian distribution at the √ n rate since the variance dominates the signal.Therefore, in this setting, we only consider hypotheses testing problems as given in Equation ( 4) with τ = 1/2.Theorem 3. Consider the model given by equation ( 1) and the hypotheses testing problem from equation ( 4).Assume further (A1), (A2), (A3), (A5) for τ ≤ 1/2, and (A7).Under the null hypothesis, if α ≥ 4K ε , then Remark 4. Compared to Theorem 1, Theorem 3 trades the Gaussian assumption for a sub-Gaussian assumption under a slightly stronger sparsity assumption in order to obtain an asymptotic distribution.From Theorem 2, F EW exhibits a continuous tradeoff between sparsity and power, which does not hold for z EW .This is a consequence of using a central limit theorem for z EW , which requires scaling by √ n.This implies that the bias should be o( √ n) and the signal from the alternative should be (n −1/2 ).
Finally, we end this section by considering the setting where the number of random effects remains bounded.

Model and Motivation
In the previous section, we considered the problem of testing a collection of random effects.However, it is often of interest to construct confidence intervals for the variance of a particular random effect.Suppose that = σ 2 ν I v .In the low-dimensional setting, there have been many procedures suggested to construct confidence intervals, from likelihood based approaches to F-test inversions (see, e.g., Jiang (2007) for a nonexhaustive list).In this section, we deal with a confidence interval for a single variance component, which can easily be extended using a Bonferroni correction or similar procedures for simultaneous confidence intervals.Alternatively, we may also invert the F-statistic from Section 2 to obtain confidence intervals for parameters of the form σ 2 ν /σ 2 ε , with such ratios being first studied by Hartley and Rao (1967).
Our high-dimensional approach is inspired by F-test inversion.However, instead of using the ratio, we again use the difference.Define Then, expanding the statistic z EW from Section 2, we have that where the second equality follows from Lemma S1 in the supplementary material.A direct calculation shows that Eξ T Qξ = σ 2 ν tr(Z T P Z W Z).Then, with proper centering and scaling, we may apply a central limit theorem for quadratic forms under a mild condition on the matrix Q.

Estimator
To estimate σ 2 ν , we consider σ 2 ν defined by By a direct calculation, it can be shown that From the above, we see that the asymptotic distribution of σ 2 ν depends on the second and fourth moments of ν and ε.To estimate the second moment of ε, we consider the estimator The problem of estimation of fourth moments requires some technical assumptions on the design, even in the lowdimensional setting.For simplicity, we only consider the setting of Gaussian mixed models and the balanced one-way ANOVA design, but we note that the arguments may be extended under suitable regularity on the design matrices Z and W. In the setting, Gaussian mixed models, the fourth moment is entirely determined by the second moment.For the setting of the balanced one-way ANOVA design with m observations per subject, we consider the estimator In both settings, we obtain a plug-in estimator σ 2 ς of σ 2 ς .By setting κν = 0 and σ 2 ν = 0, we obtain an estimator σ 2 ς,z of σ 2 ς,z for Section 2.
Remark 5.The statistic σ 2 ν is related to the classical analysis of variance method for estimating random effects.Consider the setting of a balanced one-way ANOVA model from Example 1 with μ = 0 n .Let Then, the analysis of variance estimate is given by Now, note that r Z W = q, r (Z,W) ⊥ = (m − 1)q, n = mq, and tr(Z T P Z Z) = tr(Z T Z) = mq.From the calculations in Section 3.1, we have that . Thus, the two statistics are asymptotically equivalent in the balanced one-way ANOVA setting.

Assumptions
In addition to the assumptions from Section 2, we need additional assumptions on the matrix Q and on the distribution of the random effects ν.
Remark 6. Assumptions (B1) and (B2), along with (A7), are used for a central limit theorem for quadratic forms.For a thorough discussion on these assumptions, we refer the interested reader to Section 5 of Jiang (1996).As a consequence of using a central limit theorem, we require that the number of random effects increases to infinity.Thus, we only consider the sparsity assumption (A5) in this section.

Main Results
We start by stating the asymptotic distribution of σ 2 ν .

Empirical Bayes in ANOVA Type Models
The motivating example of this problem framework is in terms of the Rasch model, originally proposed by Rasch (1960).The model that we consider is different than the classical Rasch model in that we have Gaussian responses as opposed to binary responses.Our interest in this section is not in testing whether the variance of the random effect is different from zero, but, assuming that it is different from zero, in estimating the individual components of the random effect.We use the term empirical Bayes, or compound decision, in the sense of Efron (2019) and the references therein (specifically Greenshtein and Ritov 2019).
As an example of this model, the data that we consider in Section 6 is from the Trends in Mathematics and Sciences Study (TIMSS), an international study conducted every four years to measure fourth and eighth grade student achievement in mathematics and science.We only consider data from the year 2015.Polities randomly sample a collection of nationally representative schools to take standardized examinations in both mathematics and science, with questions being either multiple choice or constructed response.Then, each student within schools takes only a subset of the questions on the exams but all questions are answered by some students in each school.In addition to recording student responses, the data also contains background covariates for schools.Martin, Mullis, and Hooper (2016) provides a more detailed description of the methods and procedures employed by TIMSS and more general information about TIMSS is available in Mullis, Martin, and Loveless (2016).
For our analysis, we only consider multiple choice questions and analyze on the level of school rather than students.To construct a response variable for school, we compute the proportion of questions answered correctly by students in that school.Note that, unlike the classical Rasch model, we assume a linear model and, for all schools, we have answers for all questions.Thus, by a central limit theorem, our response Y is approximately Gaussian.The fixed effects design X include the background covariates for the school and the random effects design Z is an indicator for the polity, with ν corresponding to the unobserved variability of the polities.In this example, since we have averaged over questions, we do not have any nuisance random effects.The problem that we consider in this section is ranking the polities based on mathematical ability and trying to estimate the average number of questions that any particular polity will answer correctly.That is, we would like to estimate μ + Zν for all polities in our dataset.

Model and Motivation
The general problem framework that we consider is for K-factor ANOVA models.However, we derive the results in the setting when K = 2.That is, we consider the model We do not assume that the design is fully crossed in the random effects.The goal in the problem is to estimate a subset of the mean vector, η μ + Zν, since we view the random effects W as nuisance.However, as the sample size increases, the number of observations per group stays bounded.In the context of the motivating data example, each school still only answers a finite number of questions as we increase the sample size.A standard approach in the low-dimensional setting would be to use an empirical Bayes estimator by placing a Gaussian prior on both ν and γ (see, e.g., Brown, Mukherjee, and Weinstein 2018), which transforms the problem into a standard high-dimensional linear mixed model.Therefore, we use a N v 0 v , σ 2 ν I v and N r 0 r , σ 2 γ I r prior on ν and γ , respectively.Since we need to estimate both σ 2 ν and σ 2 γ for the prior, our estimator for σ 2 γ is analogous to σ 2 ν from Section 3. To this end, we need an additional matrix P W Z such that

Estimator
Since we are also interested in estimating P W Z Xβ, we define βEW to be the exponentially weighted estimator using the covariates X, as opposed to using the covariates P ⊥ W X for βEW .Then, analogous to Section 2.2 let βm denote the leastsquares estimator of β using the model m ∈ M u with covariates X m and K Zν+Wγ +ε be the sub-Gaussian parameter for Zν +Wγ +ε.For α > 4K Zν+Wγ +ε , defining the exponential weights as , we have βEW m∈M u wm βm .
For convenience, we write μEW X βEW .Now, the estimators for the variance are given by As we do not require an asymptotic distribution for σ 2 ν and σ 2 γ , under weaker assumptions that Theorem 5, we have that σ 2 ν and σ 2 γ are consistent estimators of σ 2 ν and σ 2 γ , respectively.This suggests the the following empirical Bayes estimator for η, To compare our estimator, we consider an oracle that has access to μ, σ 2 ν , σ 2 γ , and σ 2 ε .Then, this oracle uses the Bayes estimator for η (see Lemma 6), given by

Assumptions
As previously mentioned, we do not need to establish the asymptotic distribution of σ 2 ν , rather we only need the estimator to be consistent.Accordingly, we may weaken our assumptions to the following (C1) The designs Z and W satisfy tr(Z T P Z W Z) tr(W T P W Z W) n. (C2) The matrix W satisfies λ max (WW T ) being bounded.
Remark 7. Assumption (C1) ensures that the component of the design for the random effects is sufficiently well balanced.This assumption in the presence of (A4) implies the second half of (A5).Note that tr(Z T P Z W Z) ≤ λ max (ZZ T )tr(P Z W ) = λ max (ZZ T )r Z W . Since r Z W ≤ n, tr(Z T P Z W Z) n and λ max (ZZ T ) being bounded imply that r Z W n.

Main Results
We start this section by noting that σ 2 ν , σ 2 γ , and σ 2 ε are all consistent estimators under a weaker sparsity assumption than in Section 3. Since we no longer require an asymptotic distribution for the variance estimates, we only need the prediction rate to ensure consistency, which is the content of the ensuing proposition.The following is a standard lemma regarding the empirical Bayes estimators in this problem setup, which we prove for the sake of completeness.Lemma 6.For a fixed vector μ ∈ R n and fixed values σ 2 ν > 0, σ 2 γ > 0, and σ 2 ε > 0, the Bayes estimator of η is given by We conclude this section with the main result regarding ηEW ; the empirical Bayes estimator performs nearly as well as the oracle Bayes estimator ηoracle asymptotically.
since it is harder to remove the contribution of the fixed effects.Finally, for empirical Bayes estimation, our methods are competitive with the oracle.However, we notice that exponential weighting outperforms scaled lasso when s = 15, particularly when ρ = 0.8.Since larger values of ρ implies that the columns of X are more correlated, this highlights a salient feature of exponential weighting.

Real Data Application
Following in the motivating example of Section 4, we consider the TIMSS dataset, which is freely available at https:// timssandpirls.bc.edu/.To simplify our analysis, we only consider the mathematics questions.After filtering out for complete cases on background covariates, we are left with 146 questions, q = 43 unique polities, p = 106 covariates, and 6808 schools.Therefore, we had a total of n = 6808 responses after averaging over the students and questions within the schools.Here, there are no nuisance random effects so d = 0. Due to averaging over students within schools, we expect the distributions to be approximately Gaussian by a central limit theorem.
To demonstrate our methodology, we use both exponential weighting as well as scaled lasso as our estimation procedure.When applying exponential weighting, we jointly tune the value of u and α using four fold cross-validation.The highdimensional F-test rejected the null hypothesis that σ 2 ν = 0 and a 95% confidence interval for σ 2 ν is (0.0021, 0.0056), which suggests that, even controlling for school background characteristics, the polity of the school impacts mathematical ability.For the last part, we define a polity's background characteristics X to be the arithmetic average of all the schools' background characteristics within that polity.Then, applying the empirical Bayes procedure, we rank the polities based on the predicted number of questions they would answer correctly.The top five polities in order from our analysis are South Korea, Singapore, Hong Kong, Chinese Taipei, and Japan.Up to some reordering, our results are mostly consistent with the report of Mullis et al. (2016) based on individual student data, who had the same top five polities.The results using scaled lasso produced the same ranking as exponential weighting and similar conclusions regarding σ 2 ν .

Proposition 1 .
Consider the balanced one-way ANOVA from Example 1.Under the assumptions of Theorem 5,