Homogeneity and Sparsity Analysis for High-Dimensional Panel Data Models

Abstract In this article, we are interested in detecting latent group structures and significant covariates in a high-dimensional panel data model with both individual and time fixed effects. The slope coefficients of the model are assumed to be subject dependent, and there exist group structures where the slope coefficients are homogeneous within groups and heterogeneous between groups. We develop a penalized estimator for recovering the group structures and the sparsity patterns simultaneously. We propose a new algorithm to optimize the objective function. Furthermore, we propose a strategy to reduce the computational complexity by pruning the penalty terms in the objective function, which also improves the accuracy of group structure detection. The proposed estimator can recover the latent group structures and the sparsity patterns consistently in large samples. The finite sample performance of the proposed estimator is evaluated through Monte Carlo studies and illustrated with a real dataset.


Introduction
Panel data include repeated observations over time for subjects such as individuals, households, or firms.The structure of panel data enables researchers to control unobserved heterogeneity due to individual characteristics of the underlying subjects.Homogeneity and heterogeneity of parameters continuously arise as a fundamental model specification issue in panel data analysis.In empirical studies, the homogeneity assumption on the slope parameters is frequently rejected, see Hsiao and Tahmiscioglu (1997) and Su, Shi, and Phillips (2016).Specifying and estimating models with slope heterogeneity is crucial in panel data modeling.
The latent group structure model is a compromise between complete slope heterogeneity and homogeneity.In this model, subjects are classified into groups, and the slope parameters are homogeneous within groups and heterogeneous across groups.A popular approach employs the finite mixture model to capture the heterogeneity in subjects.See Shen and He (2015) for mixture models based on logistic regression, and Kasahara and Shimotsu (2009) for discrete choice models based on nonparametric discrete mixture distributions.The finite mixture approach depends on the correct specification of the distribution for the observed data and the number of groups.The results may be misleading if these assumptions are violated.
Penalization is another technique for group structure detection.Su, Shi, and Phillips (2016) and Huang, Jin, and Su (2020) proposed the classifier-Lasso for recovering unknown group structures in linear and nonlinear panel data models.Wang, Phillips, and Su (2018) extended CARDS of Ke, Fan, and Wu CONTACT Zhongyi Zhu zhuzy@fudan.edu.cnDepartment of Statistics, School of Management, Fudan University.Supplementary materials for this article are available online.Please go to www.tandfonline.com/UBES.(2015) for group detection in panel data analysis.Ma and Huang (2017) and Lu et al. (2021) proposed a pairwise fusion approach to detect the group structure in cross section data.Besides penalization, Bonhomme and Manresa (2015) proposed methods based on clustering techniques that are rooted in the Kmeans algorithm.This method has been extended to quantile regression models (Zhang, Wang, and Zhu 2019;Leng, Chen, and Wang 2021), general nonlinear models (Wang and Su 2021), and models with structure breaks (Lumsdaine, Okui, and Wang 2022).Last but not least, Ke, Li, and Zhang (2016) proposed an interesting two-step method, where parameters are estimated subject-wise in the first step, and clustered based on binary segmentation in the second step.Extensions include Guo and Li (2020) who considered factor models and Yu, Gu, and Volgushev (2022) who considered uncertainty quantification in the first step.
In this article, we study a latent group structure model with high-dimensional covariates.High-dimensional covariates pose challenges on both theoretical analysis and real data implementation.On the theoretical aspect, we face the problem of recovering low-dimensional structures in a high-dimensional space.While on the computational perspective, we face the issue of developing efficient algorithms for solving the optimization problem involving high-dimensional sparse vectors.
We develop a doubly penalized least squares estimator to detect the latent group structure and significant covariates simultaneously.The number of groups and the group membership are recovered by the penalties on the pairwise distance among the slope parameters, and significant covariates are detected by the penalties on the regression coefficients.We prove that the proposed estimator can consistently recover the latent group structure and significant covariates in large samples.We implement the proposed estimator by an alternating direction method of multipliers (ADMM) algorithm (Boyd et al. 2011) coupled with the coordinate descent (CD) algorithm (Friedman, Hastie, and Tibshirani 2007;Breheny and Huang 2011).The CD algorithm exploits the sparsity structure in the regression coefficients, and it is efficient in solving penalized regression problems.We combine the CD algorithm with the ADMM algorithm to speed up computation for large datasets.
To save computational time and increase the accuracy of group detection, we propose a strategy to prune the pairwise penalties in the spirit of Ke, Fan, and Wu (2015).The pairwise penalty shrinks the pairwise distance among slope coefficients such that subjects can be classified into groups.Pairs of coefficients from different groups are redundant in the penalty.The pruning strategy relies on a preliminary estimator for the slope coefficients, which are simple marginal estimators, and we do not penalize pairs of slope coefficients for which the distance of preliminary estimates are larger than a threshold.We analyze the preliminary estimator in large samples and propose a rule for selecting the threshold in practice.The simulation study in Section 5 demonstrates that this strategy can save computational time and improve the accuracy of group detection.
The proposed estimator is related to CARDS (Ke, Fan, and Wu 2015) and its variants (Wang, Phillips, and Su 2018).Essentially, what Ke, Fan, and Wu (2015) and Wang, Phillips, and Su (2018) did was pruning the pairwise penalties based on a preliminary estimate.We prune the pairwise penalties based on thresholding which is conceptually simple and straightforward.Compared to Su, Shi, and Phillips (2016), we do not need prior knowledge on the number of groups.The non-concave penalties we used yield nearly unbiased estimates, while the LASSO introduces biases into the estimator (Zhang and Huang 2008).Theoretically, we adopt the functional dependence framework of Wu (2005), which is flexible enough to include a large class of stationary processes, for example, ARMA, ARCH, and nonlinear time series.The conditions on dependence structure are verifiable compared to the mixing framework of Su, Shi, and Phillips (2016) and Wang, Phillips, and Su (2018).Furthermore, our estimator can eliminate covariates that are irrelevant to group structure detection, and thus improve the detection accuracy.This is demonstrated in the simulation study.Lastly, the estimators of Su, Shi, and Phillips (2016) and Wang, Phillips, and Su (2018) are not available for high-dimensional models.
The rest of the article is organized as follows.In Section 2, we describe the model and the proposed estimator.We propose an algorithm to implement the estimator in Section 3. We investigate large sample properties in Section 4. Sections 5 and 6 illustrate the finite sample performance using simulation studies and a real data, respectively.Finally, we summarize and conclude in Section 7.
We introduce some notations here.For a q × p matrix A and a column index set J ⊂ {1, . . ., p}, we denote A[:,J] as the sub-matrix of A containing all the columns in J.This definition applies to rows of a matrix and vectors as well.We denote λ min (A) and λ max (A) as the minimum and the maximum eigenvalues of a matrix A, respectively.For a vector for some q ≥ 1.In particular, we denote the Euclidean norm as β = β 2 , and the infinity norm β ∞ = max i=1,...,p |β i |.For two sequences a n and b n , we denote a n b n if lim n→∞ a n /b n = 0.

The Group Structure Model and Estimation Method
Let {ỹ it , xit , i = 1, . . ., N; t = 1, . . ., T} be the observations, where ỹit ∈ R is a one-dimensional response, and xit ∈ R p is a p dimensional covariate.The dimension p of the predictor xit might be large.We consider the following panel linear regression model is the true subject-specific coefficients, denote β * i,j as the jth element of β * i .There are two commonly adopted assumptions on the regression coefficients β * .The pooled model assumes N , which ignores possible heterogeneity of the subjects.The completely heterogeneous model assumes β * i are different for each subject, which is inefficient and impractical for short panels.We consider the latent group structure model, that is, we assume that there exists a partition of {1, . . ., N} into K subsets G = {G 1 , . . ., G K }, such that the regression coefficients are the same for each group.That is, where α * k is the true coefficient for the kth group.Both the number of groups and the group membership are unknown.We are interested in recovering the group structure.
Since the fixed effects μ i and ϑ t are not of primary interest, we apply the within transformation to remove the fixed effects.Some algebra leads to where ỹi.= T −1 T t=1 ỹit , ỹ.t = N −1 N i=1 ỹit , and ỹ.. = (NT) −1 N i=1 T t=1 ỹit , the notations ˜ i. , ˜ .t, ˜ .. , and xi. are defined accordingly.The new covariates are x it = xit − xi., and the new error is it = ˜ it − ˜ i. − ˜ .t+ ˜ .. .Let β = (β 1 , . . ., β N ) , to recover the sparsity pattern and the group structure, we propose the doubly penalized least squares (DPLS) estimator as the minimizer of x jt β j ) 2 is the least square loss function, p λ (.) is a penalty function such as the SCAD (Fan and Li 2001) penalty, λ 1 , λ 2 are tuning parameters, and A is a subset of A full = {(i, j) : 1 ≤ i < j ≤ N}.The addition term N −1 N j=1 x jt β j in the least square loss L(β) results from the within transformation.There are two penalties in the objective function, the pairwise fusion penalty p λ 1 ( β i − β j |) shrinks the pairwise differences among the regression coefficients to detect latent group structures, the second penalty N i=1 p j=1 p λ 2 (|β i,j |) regularizes the regression coefficients to recover the sparsity pattern of the covariates.
In a related model, Ma and Huang (2017) sets A = A full and penalized the difference of all N(N − 1)/2 pairs of coefficients.This setting results in a great computational burden when N is large.Ideally, we only have to penalize all pairs of coefficients in the set A * = {(i, j) : i < j, i, j ∈ G k , k = 1, . . ., K}.To reduce computational burden, we propose to find the smallest possible set A such that A includes A * with probability approaching 1 (w.p.a. 1).
Define the marginal estimator as it is a least square estimator using only the observations of the lth covariate while ignoring all other covariates.The preliminary estimator is defined as , where δ NT is a proper threshold.With the above defined set A, we call the β = arg min Q(β, A) the thresholded DPLS estimator (tDPLS).We call β = arg min Q(β, A full ) the fused DPLS estimator (fDPLS).
The marginal estimator has been extensively studied in the variable screening literature (Fan and Lv 2008;Guo et al. 2022).Usually, βi,l does not converge to β * i,l .However, βi,l and βj,l do converge to the same population value if subjects i and j are in the same group, and to different values otherwise.These results are established rigorously in Proposition 1.Therefore, the maximum distance βi − βj ∞ is large when i, j are from difference groups, and small otherwise.This property is the key to the success of the preliminary screening as demonstrated in Proposition 1.
The tDPLS estimator is closely related to CARDS (Ke, Fan, and Wu 2015) and Panel-CARDS (Wang, Phillips, and Su 2018).CARDS and Panel-CARDS use the idea of segmentation to prune the pairwise penalty.We use a thresholding method to prune the penalties, which is conceptually simple and straightforward.We use a toy example to show how computational complexity is reduced.Suppose the K groups are of equal size.

The Algorithm
With the penalties on pairwise differences of coefficients, the CD algorithm can not be implemented directly to solve the optimization problem (2).In this section, we propose to minimize the objective function (2) based on the ADMM algorithm and the CD algorithm.We first introduce some nuisance parameters to circumvent the daunting task of penalizing the pairwise differences of coefficients.Let η ij = β i − β j , the minimization problem can be rewritten as where η = {η ij , (i, j) ∈ A}.We use the ADMM algorithm (Boyd et al. 2011) to solve the constrained minimization problem.
• Step 3. If the stopping rule is satisfied, terminate the algorithm; if not, go to Step 2.
We use the preliminary estimator β introduced in Section 2 as the initial value of β.The initial values for the augmented variables η are η (1) The initial Lagrangian multipliers v are set to be 0 in our algorithm.
Let D = {e i − e j , (i, j) ∈ A} where e i is a unit vector of length N whose ith element is 1 and the rest are 0. Let = D ⊗ I p , where I p is a identity matrix of size p × p.To update β, we need to solve the following penalized least squares problem, k) ).The matrix A X is a block matrix, the block on the ith (i = 1, . . ., N) row and jth , where Y i = (y i1 , . . ., y iT ) .The minimization problem (4) can be efficiently solved by the CD algorithm.We refer to Breheny and Huang (2011) for the detail of the CD algorithm.The CD algorithm uses the sparse structure of β for solving (4).The CD algorithm is fast and efficient in solving sparse high-dimensional optimization problems which enables us to handle large datasets.
To update η, we need to solve the following minimization problem, There exist closed-form solutions for the SCAD and MCP penalties.
The closed form solution for the SCAD penalty is (5) where a is a parameter in the SCAD penalty functions.
The primal and dual residuals at the kth step are defined as The algorithm is terminated when the primal and dual residuals are less than a threshold, say 10 −6 .The estimator is denoted as β.We put subject i and j in the same group when vij = 0 for (i, j) ∈ A when the algorithm terminates.The estimated number of groups is K.
Wang, Yin, and Zeng (2019) established the convergence theory of the ADMM algorithm for minimizing a non-convex and possibly nonsmooth objective function with linear constraints.They proved that the iteration sequence produced by the ADMM algorithm has limit points, and all the limit points are stationary points of the augmented Lagrangian.Our algorithm is covered by the general framework of Wang, Yin, and Zeng (2019), and we conclude that the proposed algorithm is guaranteed to converge to a local minimum of the objective function.

Asymptotic Properties
In this section, we study the large sample properties of the proposed estimator.We derive the lower bound of the threshold δ NT such that the estimated index set A ⊃ A * w.p.a. 1.We show that under some regularity conditions, the proposed estimator recovers the group structure and the sparsity pattern simultaneously w.p.a. 1.The asymptotic distribution of the estimator is also derived, which can be used in the inference of the regression coefficients.
We assume that the true parameters β * i , i = 1, . . ., N are sparse.Let the sparsity pattern J i = {1 ≤ j ≤ p : β * i,j = 0}.Denote J k as the sparsity pattern of the kth group, that is, is a q k × q j zero matrix.By the definition of H, we have β * J = Hα * J , where β * J and α * J are the true values of β J and α J , respectively.The oracle estimator βor is defined as βor J = H αor J , where and βor J c is a vector of 0. The oracle estimator is not available in data analysis, but it is very handy in theoretical analysis.
We use the framework of Wu (2005) for dependence measure and asymptotic analysis.In this framework, the errors and the covariates are stationary processes modeled as where e i,t , t ∈ Z are iid random vectors, and g i (.) and h(.) = (h 1 (.), . . ., h p (.)) are measurable functions.We use the notation ˜ i,t and ˜ it interchangeably as long as no confusion arises.Let ξ i,t = (. . ., e i,t−1 , e i,t ).The process (7) represents a large class of stationary processes.For example, linear processes ˜ it = ∞ j=0 a ij e i,t−j with ∞ j=0 a 2 ij < ∞, thus ARMA processes, are included.See more examples in Wu (2005).Let (e i,t , t ∈ Z) be an iid copy of (e i,t , t ∈ Z), ˜ i,t = g i (ξ i,t ), ξ i,t = (. . ., e i,−1 , e i,0 , e i,1 , . . ., e i,t ), and denote the rth moment of ˜ it by ˜ it r = (E|˜ i,t | r ) 1/r .The functional dependence measure (FDM), γ e i,t,r = ˜ it − ˜ it r , quantifies the dependence of ˜ i,t on e i,0 using the idea of coupling.For example, if ˜ it = ∞ j=0 a ij e i,t−j , then γ e i,t,r = |a it | e i,0 − e i,0 r .To measure the strength of dependence in the time series, we define the dependence adjusted norm ˜ ψ νe = sup r≥2 r −ν e ∞ t=0 sup i γ e i,t,r , which is introduced by Wu and Wu (2016), where ν e > 0 is a constant.If ˜ it is a linear process, then ν e = 1 if the innovations e i,t−j is sub-exponential, and ν e = 1/2 if e i,t−j is sub-Gaussian.Similarly, define the dependence measure for the covariates as follows, for some constant ν x > 0, Before stating the assumptions, we introduce some notations here.Denote i = var(x it ), and σ i,lk = cov(x it,l , xit,k ).Let b 1,NT = min 1≤k<l≤K α * k − α * l be the minimum distance of coefficients between groups, let b 2,NT = min k=1,...,K min j,j∈J k |α * kj | be the minimum coefficient value.The b 1,NT and b 2,NT will act as signal strength in the asymptotic analysis.Let The defined κ NT is the convergence rate of the oracle estimator derived in Theorem 1.
Assumption 1.(a) The covariates xit are generated from the causal process ( 7).(b) The dependence measure x ψ νx < ∞ for some ν Assumption 1(b) requires that the dependence measure of xit is finite.Along with the Assumption 2, we can establish exponential tail probabilities that are essential to prove the properties of the penalized estimators (Bickel, Ritov, and Tsybakov 2009).Assumption 1 (c) ensures that the covariates do not degenerate or diverge.Assumption 1 (d) and (f) are used to establish the consistency of the oracle estimator.In low-dimensional setting, these two assumptions imply the consistency of the sample covariance matrix of the covariates and its population version is of full rank.These assumptions are commonly adopted in panel data analysis.Assumption 2 (c) restricts the magnitude of the true regression coefficients.Assumption 3 requires that the signal strength b 1,NT and b 2,NT should be larger than the estimation error of the oracle estimator, and λ 1 and λ 2 should be less than the signal strength b 1,NT and b 2,NT to avoid bias.
We illustrate the covariates with a vector linear process.Let e i,t = (e i,t,1 , . . ., e i,t,p ) be iid random vectors, where the elements e i,t,1 , . . ., e i,t,p are iid with mean 0, variance 1, and all of its moments are finite.Let B 0 , B 1 , . . ., be p × p matrices such that ∞ t=0 B t < ∞, where B t is the spectral norm of B t .By Kolmogorov's three series theorem, the process xit = ∞ l=0 B l e i,t−l is well-defined.If sup r≥2 r −ν e 0,0,1 r < ∞ for some ν > 0, it is easy to check that x ψ νx ≤ 2 ∞ t=0 B t sup r≥2 r −ν e 0,0,1 r with ν x = ν + 1/2.Stationary vector auto-regressive models are special cases of this example.
Recall that A is defined by A = {(i, j); We first derive the lower bound on the threshold δ NT to ensure that that A includes A * w.p.a. 1.These results are built on the convergence rate of the marginal estimator.Properties of the marginal estimators have been established by Fan and Lv (2008) and Guo et al. (2022).We generalize these results in a panel data model with correlated covariates and errors generated from causal sequences.
Proposition 1.Under the Assumptions 1 and 2, log(Np) 1+4ν /T → 0 and sup Proposition 1 establishes the uniform convergence rate of the marginal estimator βil to its population value b il with a uniform T −1/2 rate with an additional log(Np) term.We require sup i |J i | to be bounded to ensure that b il do not diverge to infinity asymptotically.Although b il do not equal to β * il unless all the covariates are orthogonal to each other, b il does equal to b jl if subjects i and j are in the same group by the Assumption 1 (d), and they are not equal otherwise.With the uniform convergence property of βil , the maximum distance between βi are βj , for example, βi − βj ∞ , converges to zeros under the assumption that log(Np) 1+4ν /T → 0 when subjects i and j are in the same group, and βi − βj ∞ is bounded from below otherwise.Therefore, with a proper threshold δ NT , we can use the maximum distance between the preliminary estimators to filter pairs of subjects which are not in the same group and ensure that A ⊃ A * w.p.a. 1.The number of covariates p can be as large as exp{T 1/(1+4ν) }.
The assumption with the Assumption 1 (e) ensure that the objective function ( 6) in the definition of the oracle estimator has a positive definite Hessian matrix asymptotically.Theorem 1 establishes the convergence rate of the oracle estimator.When the number of groups K is constant and the group sizes are uniform, the convergence rate of αor J is NT, which is comparable to the classical results for convergence rates of estimators of increasing dimensions (He and Shao 2000).The log term is due to the dependence in the covariates and the errors.
In Theorem 2, we establish that the oracle estimator is a local minimizer of the objective function w.p.a. 1.Thus with suitable algorithms the group structure and the sparsity pattern can be recovered asymptotically.By the assumption in Theorem 2 and the restriction that |G min | ≤ N/K, we need K 3/2 Q m NT/(log NT) 1+2ν .Thus, the number of groups must satisfy K (NT/(log NT) 1+2ν ) 1/3 .From Assumption 3, the signal strength b 1,NT must be greater than κ NT , and the smallest coefficient b 1,NT must be greater than log(Np) 1/χ 1 / √ T. When Q m is a fixed, the number of covariates p can be as large as exp{T 1/(2+4ν) } by the assumption in Theorem 2. In a related work, Wang and Su (2021) restricts p T 2/3 to ensure consistency of group detection.We greatly relaxed this assumption to allow p grows exponentially as a function of T, although we need a more stringent condition on the moments of the covariates and the errors.

Simulation Study
We investigate the finite sample performance of the proposed estimator by a simulation study in this section.To determine the tuning parameters in the objective function, we minimize the modified Bayesian Information Criterion (mBIC) of Wang, Li, and Tsai (2007), where qk (λ 1 , λ 2 ), k = 1, . . ., K are the estimated number of nonzero coefficients in the ith group, and C NT is a diverging sequence which depends on the sample size N and T. Note that we have made the dependence of β(λ 1 , λ 2 ), qk (λ 1 , λ 2 ), and K(λ 1 , λ 2 ) on λ 1 and λ 2 explicit in (9).In this article, we adopt the strategy of Wang, Li, and Tsai (2007) and set C NT = log(log(NT + p)).Minimizing (9) simultaneously in both λ 1 and λ 2 is very time consuming, we select λ 1 and λ 2 sequentially in the following way: we first fix λ 2 to be a small value in mBIC(λ 1 , λ 2 ), and we minimize (9) to select λ 1 .When λ 1 is determined, we fix it in mBIC(λ 1 , λ 2 ) and select λ 2 by minimizing (9).In tDPLS, the threshold parameter δ NT controls how the index set A is constructed.Proposition 1 calibrates the lower bound on the value of δ NT such that the index set A ⊃ A * w.p.a. 1.In practice, we propose to choose δ NT based on clustering β by the agglomerative hierarchical clustering, where the maximum distance between points and the complete linkage between clusters are adopted.The number of clusters is selected by the gap statistic with the one standard error rule, we refer to Tibshirani, Walther, and Hastie (2001) for the details of the gap statistic.Let K be the selected number of clusters for clustering preliminary estimators, δ NT is determined by the maximum within cluster distances, δ NT = max k=1,..., K max i,j∈ Gk βi − βj ∞ . The grouping generated by the above naive hierarchical clustering (Hclust) method is also shown in the simulation study.In the simulation study, the penalty function in the objective function (2) is SCAD.
We use the Rand index (RI) defined in Rand (1971) to measure how accurate the estimated group is.The Rand index takes value in [0, 1], with values close to 1 mean that the true grouping structures and the estimated grouping structures are similar.The estimation error (EE) is defined as the average of || β − β * ||/ Np over the simulation runs, which measures the accuracy of parameter estimation.
Example 1.The setting is adapted from Su, Shi, and Phillips (2016).The observations (ỹ it , xit ) are generated from model (1).The covariates xit = 0.1μ i + e it , where e it is a random vector with independent standard normally distributed components, and e it is independent of μ i .The fixed effects μ i and the error ˜ it are generated independently from the standard normal distribution.There is no time effects in this example, because Su, Shi, and Phillips's (2016) method do not cover models with time effects.The number of groups is K = 3, and the true coefficients are (0.4,1.6, 0 p−2 ), (1.0, 1.0, 0 p−2 ), (1.6, 0.4, 0 p−2 ) for the three groups, respectively, where 0 p−2 is a vector 0 of length p−2.The subjects are drawn from these three groups with the proportion 0.3 : 0.3 : 0.4.We compare our methods with the Hclust estimator, and the classifier-lasso (CLASSO) of Su, Shi, and Phillips (2016).CLASSO is a variant of LASSO penalty which takes an mixed additive-multiplicative form, it is designed for identifying and estimating latent group structures in panel data but it is not able to recover group structures for high-dimensional data.The oracle estimator is used as a reference level in the simulation.
In the simulation, the sample sizes are N = 50, 100, 200 and T = 20, 40.The number of the covariates is p = 2, 5, 10.The boxplot of the Rand Index (RI) and the estimation error (EE) for N = 100 over 100 simulation runs are depicted in Figures 1  and 2, respectively.The results for N = 50 and N = 200 are shown in the supplement materials.We observe that when there are no redundant covariates, for example, p = 2, all the methods perform equally well.When there exist redundant covariates, for example, p = 5 and p = 10, both fDPLS and tDPLS outperform CLASSO because they have higher RI and lower estimation error.The proposed estimator can eliminate covariates that are irrelevant to the model, which significantly improve the accuracy of group structure detection.Comparing fDPLS and tDPLS, we observe that tDPLS performs better in recovering group structures and estimation accuracy in almost all cases (Figures 1 and 2).As the sample size T increase, the estimator errors of both fDPLS and tDPLS approach that of the oracle estimator as the theory predicts (Figure 2).
Example 2. The covariates are generated by xit = 0.1μ i + 0.1ϑ t + e it .The term e it is a normal random vector, the mean is zero, and the covariance between the ith component and the jth component is 0.5 |i−j| .The fixed effects μ i and ϑ t are generated independently from the standard normal distribution, the model error ˜ it are generated from N(0, 1).The subjects are sampled from K = 3 groups with proportion 1:1:1.There are three significant covariates in each group.The sparsity pattern is J 1 = J 2 = J 3 = {1, 2, 3}.For each group, the regression coefficients are generated from a uniform distribution U(0.8, 2.0) with the restriction that the Euclidean distances of the regression coefficients between groups are greater than 1 to ensure separability between groups.The sample sizes are N = 50, 100, 200 and T = 20, 40.The dimension of the covariates is p = 100.
We compare fDPLS, tDPLS, and Hclust estimator, and the performance of the oracle estimator is shown as a reference.The results for CLASSO are not available because it can not handle high-dimensional data.The boxplot of the Rand index and estimation error are presented in Figures 3 and 4. The computing time and the selected number of groups are presented in the supplement.As predicted by the theory in Section 4, we observe that as the sample size T increases, the performance of both fDPLS and tDPLS improves because the Rand index increases and the estimation error decreases.Generally speaking, tDPLS performs better than fDPLS in terms of RI and estimation error (Figures 3 and 4).From Figure A.5, supplementary materials, we observe that the computational time of tDPLS is much less than that of fDPLS.In most cases, tDPLS only needs half of the computing time of fDPLS.We conclude that pruning the pairwise penalty can reduce computational time and improve the grouping and estimation accuracy.

Economic Growth Data Analysis
In this section, we analyze an economic growth dataset to illustrate applications of the proposed methods.The relationship  between economic growth and demographic variables has been investigated by many authors.Demographic factors offer potential opportunities for economic growth through several channels, which are the so-called first and the second demographic dividend.For example, a large working age population relative to non-working age populations offers more labor forces in the market.As a result, more savings and more resources are accumulated for economic development (Prskawetz et al. 2007).Lindh and Malmberg (1999) found that in the OECD countries the 50-64 age group has the largest positive effect on the growth rate of GDP per worker.Prskawetz et al. (2007) summarized that the working age population is positively and significantly related to growth, the youth population is negatively related to growth, while the effect of the elder population is ambiguous.
To obtain a deeper understanding of economic growth and demographic factors in a large panel of developed and developing countries, we study the relationship between age structure and economic growth using the proposed model.The demographic data are extracted from the World Bank Development Indicators, and the economic data are from the Penn World Table (Feenstra, Inklaar, and Timmer 2015).After removing missing values and variables that do not show enough withinsubject variations, a panel of 84 countries in the time period between 1975 and 2017 is included in the analysis.The response is the growth rate of GDP per capita measured in the year 2011 U.S. dollars.We consider a rich set of control variables in the analysis, such as the human capital, capital services, and price levels.The first lag of the log GDP per capita is also included to  account for the convergence effect.The included demographical variables are the log population size, the population growth rate, the urban population growth rate, and the age structure variables that span every ten years, that is, the log share of the population aged between 0 and 14, 15 and 24, etc.The detailed list of control variables is shown in Table A.1, supplementary materials.Besides the basic control variables, we include general trends by allowing for the interaction of a linear trend with control variables and their second-order terms.The total number of control variables is 75.We use marginal estimator as the preliminary estimator and SCAD penalty for the tDPLS estimator, the penalty factor in the ADMM algorithm is set to be ρ = 1, and the tuning parameters in the objective function (2) are selected by the mBIC criterion (9).
After estimation, we detect four groups which consist of 18, 39, 15, and 9 countries, respectively.The group members of the four groups are shown in Figure 5, a detailed list of countries in each group is shown in Table A.2, supplementary materials.The first and second group consist of mostly developed countries, although some developing countries are also included.For example, Switzerland and Denmark are in the first group, while Canada, Germany and United States are in the second group.The third and fourth group consist of mostly developing countries, for example China and Indonesia are in   1. Conditional on the selected group structures and covariates, the standard errors of the estimated coefficients are calculated by the clustered covariance matrix (CCM) estimator of Arellano (1987) and Hansen (2007) .The CCM estimator allows essentially arbitrary correlation within each subject, and it does not require a kernel function and a bandwidth.In total, 13 covariates were selected by our method, 3, 8, 3, and 7 covariates selected for group 1 to group 4, respectively.The first lag of GDP per capita is significant in groups 1, 2, and 4, which signifies convergence effects.We focus on the effects of the demographic covariates in the following.The age structure covariates log age 25 to 34 and log age 65 plus have significant and positive effects in the second group, which incorporates mostly developed countries, such as United states.The positive age effects signify that the economic growth of the countries in group 2 has benefited both from the working-age population in the past half-century, which represents labor forces (Prskawetz et al. 2007), and the elderly population, which is supported by Acemoglu and Restrepo (2017) who explains this effect as the endogenous response of technology.Group 3 has a negative but insignificant effect on youth working-age population log age 15 to 24.Lastly, group 1 has a positive effect on the interaction of the total population growth rate and time.

Conclusion
In this article, we explored heterogeneity of slope coefficients in different subjects in a high-dimensional panel linear regression model.We proposed the doubly penalized least squares (DPLS) estimator to detect latent groups and select significant covariates.A thresholded version of DPLS (tDPLS) was also proposed to alleviate the computational complexity in DPLS, which takes advantage of a preliminary estimate to prune the pairwise penalty in DPLS.The tDPLS not only saves computational time but also improves estimation and clustering accuracy as demonstrated in the simulation study.Through simulation studies, we found that DPLS can eliminate covariates that are irrelevant to the model, which significantly improves the accuracy of group structure detection compared to Su, Shi, and Phillips (2016).For future works, high-dimensional heterogeneous nonlinear models and quantile regression models are interesting extensions for exploring nonlinear heterogeneity effects in large panel data.

Figure 1 .
Figure 1.The boxplot of Rand Index (RI) for example 1 with N = 100 over 100 simulation runs.The dimensions p are 2, 5, 10 from left to right, T ranges from 20 to 40 from top to bottom.

Figure 2 .
Figure 2. The boxplot of estimation error (EE) for example 1 with N = 100 over 100 simulation runs.The dimensions p are 2, 5, 10 from left to right, T ranges from 20 to 40 from top to bottom.

Figure 3 .
Figure 3.The boxplot of Rand Index (RI) for example 2 over 100 simulation runs.

Figure 4 .
Figure 4.The boxplot of estimation error (EE) for example 2 over 100 simulation runs.

Figure 5 .
Figure 5.The estimated groups for the economic growth data.

Table 1 .
The estimated coefficients of the selected covariates for the economic growth data, standard errors are listed in parentheses.NOTE:The covariate log age 15 to 24 is the log share of population aged between 15 and 24 years.HC is the human capital index.RKNA is the capital services.RWTFPNA is the welfare-relevant total-factor productivity.IRR is the real internal rate of return.POP is the total population growth rate.PLN, PLK and PLGDPO are the price level of capital stock, capital services, and current price GDP, respectively.thethirdgroup.The third group has the lowest average GDP per capita, and most countries in the fourth group experienced two significant recession periods in the1980-1986 and 2014-2016.The average GDP per capita for the detected groups is shown in Figure A.10, supplementary materials.The selected variables and the estimated coefficients are shown in Table 120 (0.018) 0.017 (0.024) 0.129 (0.017) IRR 0.055 (0.026) 0.315 (0.072) −0.176 (0.087)