Detecting Unobserved Heterogeneity in Efficient Prices via Classifier-Lasso

Abstract This article proposes a new measure of efficient price as a weighted average of bid and ask prices, where the weights are constructed from the bid-ask long-run relationships in a panel error-correction model (ECM). To allow for heterogeneity in the long-run relationships, we consider a panel ECM with latent group structures so that all the stocks within a group share the same long-run relationship and do not otherwise. We extend the Classifier-Lasso method to the ECM to simultaneously identify the individual’s group membership and estimate the group-specific long-run relationship. We establish the uniform classification consistency and good asymptotic properties of the post-Lasso estimators under some regularity conditions. Empirically, we find that more than 30% of the Standard & Poor’s (S&P) 1500 stocks have estimated efficient prices significantly deviating from the midpoint—a conventional measure of efficient price. Such deviations explored from our data-driven method can provide dynamic information on the extent and direction of informed trading activities.


Introduction
Large dimensional financial and macroeconomic panel data typically exhibits two stylized facts-comovements and unobserved heterogeneity. The comovements appear important enough to account for the long-run relationships among economic variables, and to capture short-run dynamics driven by common shocks. For example, market microstructure models are always built on the principal belief that the bid and ask prices of an individual security share the same underlying true value. This assumption implies the existence of a long-run equilibrium between the bid and ask prices for each stock. Moreover, the prices of different stocks are also influenced by common factors, such as systematic risks and policy shocks. Another challenge in panel data is to control for the unobserved heterogeneity, which takes place in both incidental and structural parameters. Although classical panels employ fixed effects to model individual-specific heterogeneity in the intercept, there is a lack of consensus on modeling unobserved heterogeneity in the slope parameters, despite its prevalence in economic and financial applications. This article aims to fill these voids and estimate panel error-correction models with both comovements and unobserved parameter heterogeneity. In the application, we seek to study how bid and ask prices contribute to the underlying efficient price through the unobserved heterogeneity in the bid-ask long-run equilibrium.
In the article, we propose a class of panel ECMs in which we impose latent group structures to capture unobserved heterogeneity in the long-run cointegration matrices/vectors and use the stationary common factors to capture the unobserved shortrun comovements. Early works on panel ECMs assume either completely homogeneous or completely heterogeneous longrun cointegrating relationships (see, e.g., Groen and Kleibergen 2003;Larsson and Lyhagen 2007). The homogeneous panel ECM assumes a common long-run relationship to facilitate estimation and inference, but tends to be rejected in empirical researches because of misspecification. The heterogeneous panel ECM mainly relies on the time series cointegrating regressions and yields estimates with slower convergence rates than the panel analog. In contrast, we follow the lead of Su, Shi, and Phillips (2016, SSP hereafter) and assume that the cointegrating matrices exhibit certain latent group structures, which allow for flexibility in modeling the long-run relationships while preserving asymptotic efficiency in the panel data analysis. Recently, various articles study latent group structures on different cases; see, for example, grouped fixed effects (Bonhomme and Manresa 2015), structural breaks (Qian and Su 2016), panels with interactive fixed effects (Su and Ju 2018), nonparametric regressions (Chen 2019;Vogt and Linton 2020), nonlinear panels (Wang and Su 2021), and high dimensional forecasting combinations (Shi, Su, and Xie 2021). All of these works focus on stationary panel models. Huang, Jin, and Su (2020) and Huang et al. (2021) explore the latent group patterns in single-equation cointegrated panel models without and with cross-sectional dependence, respectively. In contrast, this article extends the Classifier-Lasso (C-Lasso) method to the system-equation panel ECMs where we allow for short-run comovement and formalize the estimation procedure and theoretical properties.
Compared to existing works on using the C-Lasso method, our primary interest is the consistent and efficient estimation of the long-run cointegration matrices, which is essential to the construction of our efficient price measure. The estimation is done in multiple steps. First, we notice that stationary common factors do not influence the consistency of long-run cointegration matrices estimates, but introduce a nonnegligible bias. We thus, ignore the short-run component at the first stage of estimation, and directly optimize the C-Lasso objective function on the long-run cointegration matrices to obtain consistent estimators and to simultaneously recover the unobserved group structures. Second, we employ the principal component (PC) method to estimate the unobserved stationary common factors and use them to construct the bias-corrected post-Lasso estimators for the long-run cointegration matrices. Lastly, we employ an iterated procedure to update the estimators of both the longrun cointegration matrices and the short-run common factors until numerical convergence.
Our theoretical results mainly focus on the asymptotic properties of the C-Lasso estimators of the long-run cointegration matrices, allowing for general weak serial dependence structures in error terms. We establish the pointwise consistency and classification consistency for our PGLS-based estimation procedure. The pointwise consistency indicates that the C-Lasso estimators of the long-run parameters preserve the usual superconsistency in nonstationary time series even in the presence of unobserved stationary common factors and weakly dependent error processes. The classification consistency indicates that all individuals are classified into the correct groups with probability approaching one (w.p.a.1). Therefore, the group-specific Lassotype estimators could benefit from the cross-sectional information and achieve a faster convergence rate than the pure timeseries estimators. We further establish the asymptotic (mixture) normal distributions for the C-Lasso estimators. These asymptotic properties are established in Huang, Jin, and Su (2020) and Huang et al. (2021) for single-equation cointegration panel models but are not available for panel ECMs with shortrun comovements. This article modifies the penalized objective function to partial out the short-run parameters in the estimation and then shows the theoretical irrelevance of these shortrun parameters to the classification problem.
Based on the theoretical analysis, we propose a new datadriven measure of efficient price, which exploits the unobserved asymmetric information from the bid and ask prices. We first link the Granger partial sum representation for nonstationary vectors in an ECM to the economic definition of efficient price in theoretical market microstructure models proposed in Glosten and Milgrom (1985), where they suggest that the bid and ask prices share a common efficient price with the martingale property. We then propose a cointegration-based permanent-transitory decomposition to separate this common random-walk component. With this method, we can derive a new expression of efficient price: That is, the efficient price is a weighted average of the bid and ask prices, where the weights are constructed from the bidask long-run cointegration relationship (characterized by b 0 i ) and estimated by the C-Lasso method. Moreover, due to the heterogeneity in the bid-ask cointegration relationships, our new measure above varies across stocks via b 0 i , and the bid-ask midpoint is a special case when the bid-ask relationship is oneto-one, viz, b 0 i = −1. Our weighted-average measure solves an important problem raised by Easley and O'Hara (2003): the bidask midpoint need not be a good proxy for the underlying efficient price. Addressing this issue needs an accurate decomposition of the unobserved efficient price and microstructure noise from the observed price series. In the presence of informed trading, this decomposition is nontrivial for two main reasons. First, few market participants can directly observe informed trading in the financial markets. Second, even with access to the probability of informed trading (PIN) measure (see Easley et al. 1996), the knowledge to identify informed buying (selling) based on good (bad) news is limited. Our C-Lasso method provides an effective way to detect the unobserved information-based heterogeneity in the bid-ask long-run relationship for the first issue. For the second issue, the bid and ask weights of our efficient price measure intuitively show the extent and direction of informed trading.
By using the one-minute bid and ask quotes of S&P 1500 stocks from the New York Stock Exchange's (NYSE) Trade and Quote (TAQ) database from 2004 to 2018, we find that more than 30% of stocks in our sample have estimated efficient prices deviating significantly from the bid-ask midpoint. Theoretically, the deviation is caused by the unbalanced order flows which reflect unobserved and valuable private information from informed traders. Furthermore, with the recovery of the latent group structures we can automatically classify all the stocks in our sample into the "midpoint" and "nonmidpoint" groups, and examine whether the efficient prices of stocks with greater information asymmetry are more likely to deviate from the bid-ask midpoint. Consistent with our hypothesis, we find that nonmidpoint-group stocks have smaller market capitalization, higher book-to-market ratios, worse past performance, lower institutional ownership, higher return volatility, and are less liquid. Thus, although the C-Lasso method uses a statistical rule rather than economic intuition to identify the "midpoint" and "nonmidpoint" group stocks, the classification is consistent with economically meaningful drivers.
In sum, our panel ECM with latent group structures allows us to study how unobserved heterogeneity in the bid-ask longrun equilibrium contributes to the formation of efficient prices and further identifies unobserved informed trading activities. Our article is closely related to three recent studies on the measure of efficient prices, namely, Clinet andPotiron (2019, 2021) and Hagströmer (2021). All three works demonstrate that a significant bias exists in the midpoint proxy and find it empirically evident in the S&P 500 stocks. Our article provides an alternative solution with more flexibility, where the midpoint is a special case when information asymmetry is not severe, and the deviation from the midpoint becomes significant when informed trading exists. Furthermore, both Clinet and Potiron (2021) and Hagströmer (2021) require additional information on intraday transactions data, which is not publicly available and has the missing data issues (see O'Hara, Yao, and Ye 2014). Thus, the real-time availability and accurate interpretation of various trading data-based measures should be a major concern for researchers using the above two methods. In contrast, our study only uses the best-quoted prices to construct the efficient price measure. This allows all market participants to use real-time public information and minimize potential errors in matching trades and quotes, which greatly simplifies data processing and captures latent informed trading activities in time.
The rest of the article is organized as follows. Section 2 describes key features in our panel error-correction model and the C-Lasso estimation procedure. Section 3 illustrates the main assumptions and theoretical properties of our method. Section 4 studies the construction of efficient price and the estimates of efficient price empirically and provides several tests to validate the economic origins of our data-driven measure. Section 5 concludes. The supplementary materials provides the proofs of the main theoretical results, some technical lemmas used in the proofs, some additional empirical results, and some simulation results.
Notation. For any m × n real matrix A, we write its Frobenius norm and transpose as A and A , respectively. Let P A = A(A A) −1 A and M A = I − P A , where A A is of full rank, and I is an identity matrix. The operator p → denotes convergence in probability, and ⇒ weak convergence. Unless indicated otherwise, we use (N, T) → ∞ to signify that N and T pass to infinity jointly.

Model and Estimation
In this section, we first describe our panel ECM with latent group structures and unobserved stationary common factors. Then we extend the C-Lasso estimation procedure to our framework to consistently estimate the long-run cointegration parameters.

A Heterogeneous Panel Error-correction Model with Latent Group Structures
We consider a panel dataset consisting of N cross-section units (individuals) over T time periods. For each i = 1, . . . , N, the J− dimensional y it is generated by where s it is a J × 1 vector of the unobserved idiosyncratic nonstationary I(1) component, f 0 t is an m × 1 vector of unobserved stationary I(0) common factors, and ψ 0 i is an m × J matrix of factor loadings. For each i = 1, . . . , N, we assume that s it satisfies an ECM: where β 0 i is a J × r matrix of long-run cointegration vectors, α 0 i is a J × r matrix of short-run adjustment parameters which describe how s it adjusts to the long-run equilibrium relationship β 0 i s i,t−1 , r is the cointegration rank, 0 il is a J × J full rank matrix of short-run dynamics parameters, ε it is the idiosyncratic error term with zero mean and finite variance. If r = J, there is no cointegration among y it . If r = 0, model (2.2) reduces to stationary var(p − 1) models with the firstdifferenced data s it . Since our focus is on the reduced rank case with long-run cointegration relationships, it is appropriate to assume 1 ≤ r < J.
By (2.1) and (2.2), the observed process y it follows a heterogeneous panel ECM with unobserved dynamic common factors for all i = 1, . . . , N, . Then model (2.3) can be re-written as follows where the dimension of static factor F t is M×1 with M = m(p+ 1). Model (2.4) is the key model to be studied, which can be regarded as the data generating process (DGP). Note that α 0 for any nonsingular r × r matrix H, α 0 i and β 0 i are not separately identified without imposing r 2 identification conditions. A conventional way is to assume that the leading r × r submatrix of the cointegrating matrix β 0 i is an identity matrix such that β 0 i = I r , b 0 i where I r is an r-dimensional identity matrix and b 0 i is left unspecified. Following the lead of SSP (2016) and to keep the panel ECM parsimonious, we assume that b 0 i 's exhibit the following latent group structures: k denote the cardinality of the set G 0 k . When K is fixed and N k /N → c k ∈ (0, 1) for each k, we will show that the long-run parameters B 0 k 's can be estimated at √ NT-rate, which is faster than the usual T-rate for the time-series cointegrating regression estimates. Note that we do not impose any structure on the short run parameters α 0 i .

Remark.
We focus on the above data generating process (2.4) to develop estimation procedure and asymptotic properties. This panel ECM has the following key features: (a) β 0 i is a long-run cointegration matrix summarizing the long-run comovement relationship among the J nonstationary variables in y it , and (b) F 0 t is a vector of unobserved stationary common factors, which stands for the sources of short-run comovement across individuals due to certain common shocks. The stationarity of F 0 t is essential to ensure that the two sides of equation (2.4) are both stationary when β 0 i is a cointegrating vector. In order to maintain parameter parsimony and allow for certain degree of parameter heterogeneity, we assume that the long-run cointegration matrices β i 's exhibit certain unobserved group patterns in (2.5), namely, they are heterogeneous across groups and homogeneous within a group. Our interests are to infer the latent group identity and to obtain efficient estimators of the long-run cointegration matrices, which is the key for our efficient price construction.

The C-Lasso Estimation Method
In this section, we propose two Lasso-type estimators for the long-run cointegration matrices β 0 i 's and B 0 k 's, namely, C-Lasso and post-Lasso estimators. We impose the latent group structures on the long-run cointegration matrices. Throughout the article, we assume that the number of groups, K, is known and fixed, but each individual's group membership is unknown. Empirically, we can either rely on some prior economic beliefs to set the number of groups or employ certain information criterion to determine it consistently as in Huang et al. (2021).
In what follows, we describe the estimation procedure in four steps.
Step 1: Initial GLS estimation. For notation simplicity, we assume that the number p of lags is 1. The general case with p > 1 involves no fundamentally new ideas but more complicated notations. When p = 1, model (2.4) becomes In what follows we will obtain prior estimates of β 0 i 's via generalized least squares (GLS). To introduce the estimates, we consider the triangular system restrictions as in Phillips (1991) by using the identification condition: β 0 i = (I r , b 0 i ) and assuming that the cointegrating rank r is known. Let y it = (y it and y (2) it are the r × 1 and (J − r) × 1 sub-vectors of y it , respectively. Due to a nonnegligible bias in the estimate of α 0 i , we need reparameterization. Let Then we rewrite the model (2.6) as follows vv,i v it−1 are weakly stationary processes, and we use the fact that vec(ABC) = (C ⊗ A)vec(B) for any conformable matrices A, B, and C with vec denoting the usual vectorization operator.
Due to the nonnegligible bias in the short-run parameter estimates, we premultiply both sides of (2.7) byγ i to obtain 2) and (D.3) in the supplementary materials gives the infeasible least squares (LS) and GLS estimators of b 0 i for the above model. The feasible GLS estimator of b i can be obtained from replacing α i and˜ i by the Johansen's (1991) Johansen's (1991) estimates of α 0 i and β 0 i based on individual time series regressions. Writing (2.8) in matrix form with the estimated values, we have The feasible GLS estimatorβ i can be directly obtained from the following equation The initial estimatesb GLS i is a purely time-series estimators, which is consistent but inefficient and suffers a nonnegligible bias in the presence of weakly dependent error process. In the following step, we employ the cross-section information to improve the estimation efficiency and allow for unobserved parameter heterogeneity via the presence of latent group structures.
Step 2: Penalized GLS estimation. In this step, we propose a penalized GLS (PGLS, hereafter) method to estimate the group-specific cointegration matrix b i and identify the group membership. Since b 0 i 's exhibit latent group structures, we propose the following PGLS criterion function to estimate i,−1 b i is the GLS objective function, and λ ≡ λ N,T is a tuning parameter. Minimizing the PGLS criterion function in (2.11) produces the Classifier-Lasso estimatorsb i andB k of b 0 i and B 0 k , respectively. The above PGLS criterion function comprises two components: the GLS objective function and the additive-multiplicative penalty term for the long-run cointegration parameters. Using the GLS objective function has three advantages in terms of both finite sample performance and asymptotic properties. First, the Johansen's maximum likelihood estimator (MLE) tends to perform poorly, even to produce implausible estimates in small samples, while the GLS estimator tends to be much more reliable (see Phillips 1994). Second, the GLS objective function employs the triangular system restriction to split the matrix of long-run cointegration vectors from the short-run parameters. This simplifies the asymptotic analyses on long-run estimators. Third, the PGLS-based C-Lasso estimator achieves oracle efficiency from the latent group structures, and compared to the time series MLEs, it reduces the number of long-run parameters to be estimated from Step 3: Principal component analysis estimator of the unobserved factors.
i has a pure factor structure. Following Bai and Ng (2002), we obtain the consistent estimators of F 0 by solving the following eigenvalue decomposition problem, and V NT is a diagonal matrix consisting of the Mr largest eigenvalues of the matrix inside the square brackets in (2.12), arranged in decreasing order. We impose the following normalization conditions: and one usually requires vec(˜ 0 ) vec(˜ 0 ) be diagonal with diagonal elements arranged in descending order. The two conditions in (2.13) uniquely determine the˜ 0 and F 0 .
Step 4: The (bias-corrected) post-Lasso estimators. Case 1: Independent Error Terms. In the simplest case, the error process ε it and common factors F 0 t are independent across t. Therefore, there is no bias in the long-run cointegration estimatorsb i and B k . Given the estimated group identities {Ĝ k , k = 1, . . . , K}, we can directly pool the observations within each estimated group to obtain the post-Lasso estimator: Case 2: Weakly Dependent Error Terms. In this case, the error process ε it and stationary common factors F t are weakly dependent across t, which is modeled by a linear process (see Assumption 3.1). We follow Huang, Jin, and Su (2020) to employ the dynamic OLS for bias correction. LetẐ it denote a collection of the lags of y (2) it−1 and the estimated stationary common factorF t : where l = T 1/4 . We update the post-Lasso estimators as follows

Asymptotic Results
In this section we study the asymptotic properties of the estimators. Without loss of generality, we assume that F 0 t has zero mean. Let w it = (ε it , F 0 t ) and C = σ (F 0 , 0 ) be the sigmaalgebra generated by F 0 and 0 .
where e it is a (J + M) × 1 vector sequence of iid random variables with zero mean and variance matrix I J+M and max 1≤i≤N max 1≤t≤T E( e it 2q+ ) < ∞, where q > 4 and is an arbitrarily small positive constant. Decompose e it = (e ε it , e F t ) . e ε it and e F t are mutually independent; e ε it are independent across i conditional on C.
i is independent of e jt for all i, j, and t. (iv) ε it are cross-sectionally independent conditional on C.
Assumption 3.1(i)-(ii) impose that the innovation process {w it } is a linear process that exhibits certain moment and summability conditions. Let Phillips and Solo (1992), the finite 2q + moments (q > 4) of e it ensure the validity of the law of large numbers (LLN) and functional central limit theorem for the partial sum processes of w it . In our asymptotic analysis, we will frequently call upon the Beveridge- indicates thatw it has Wold decomposition and behaves as a stationary process. By Phillips and Solo (1992, p. 973 In our case, we need stronger conditions to ensure the uniform behavior across i. For later references, we partition φ i (L) conformably with w it as follows: We set φ εF i (L) = 0 to ensure that ε it is cross-sectionally independent conditional on C. Assumption 3.1(iii) ensures that the factor loadings are independent of the generation of innovation processes over both the time-series and crosssection dimensions. Assumption 3.1(iv) emphasizes that the cross-section dependence only comes from unobserved common stationary factors. The idiosyncratic error terms ε it are cross-sectionally independence given the information of factor and factor loadings.
Assumption 3.2 gives conditions that are standard in the error-correction model with reduced rank restrictions. Assumption 3.2(ii) rules out the cases of no cointegration relationships among y it . Assumption 3.2(iii) ensures that the matrix β 0 i α 0 i has a full rank for each individual i and β 0 i y it is a stationary process with the Wold representation.
Assumption 3.3(i) implies that each group has an asymptotically nonnegligible number of individuals as N → ∞. Assumption 3.3(ii) requires the perfect separability of the group-specific parameters, and similar conditions are assumed in the panel literature with latent group patterns (see Bonhomme and Manresa 2015;SSP 2016). On the one hand, this assumption is essential to establish the classification consistency results which provide a good approximation for inference results. On the other hand, it prohibits the establishment of a subsequent postclassification inference that is uniformly valid over sequences of models where groups are separated but by a small enough margin so that they cannot be perfectly separated relative to the sampling variation. In addition, there may be poor approximations in settings where individuals are not perfectly classified in finite samples. Recently, Bonhomme, Lamadon, and Manresa (2021) and Freeman and Weidner (2021) remove this assumption and allow the unobserved heterogeneity in fixed effects to be not fully discrete but a function of a low-dimension continuous latent type. In contrast, we focus on the discrete unobserved heterogeneity in the slope coefficients and still maintain the perfect group structures. Assumption 3.3 (iii)-(iv) impose conditions to control the rates of N and T passing to infinity, which is important for the proof of uniform classification consistency. In particular, they require that T passes to infinity at a rate faster than N 1/2 but slower than N 2 . The involving of ι T is due to the use of the law of iterated logarithm (LIL). We can show that the range of values for λ satisfying Assumption 3.
NT . The consistency of initial estimatorsb GLS i andF is ensured by the following theorem.
Theorem 3.1. Suppose that Assumptions 3.1-3.2 hold. Then Theorem 3.1(i)-(ii) establish the point-wise consistency for estimators of the short-run adjustment matrixα i and the longrun cointegration vector b 0 i . We summarize some key findings. First, the estimator of the short-run adjustment matrix is inconsistent around the true value α 0 i when uv,i (1) = 0. Instead,α i is consistent with the pseudo-true valueα i = α 0 i + uv,i (1) −1 vv,i , where uv,i (1) comes from the serial correlation and endogeneity in the innovation processes of ε it and F 0 t . When we have the iid assumption, the first part of Theorem 3.1(i) reduces to the case in Johansen (1991):α i − α 0 i = O p (T −1/2 ). Second, despite the weak dependence, we can still obtain superconsistency for the long-run cointegration matrix estimator. This GLS estimator is similar to the two-step parametric estimator of Breitung (2005), where he focuses on the iid case. Based on the convergence rate ofb GLS i , we can show that the spaces spanned by the columns ofF and F 0 are asymptotically the same.
Next, we show the preliminary rates of convergence for the PGLS-based estimatesb i andB k .
Theorem 3.2(i)-(ii) establishes the point-wise and meansquare consistency for the long-run cointegration matrix estimatorb i . Theorem 3.2(iii) indicates that the estimator vec(B k ) consistently estimates the true group-specific coefficient vec(B 0 k ) up to relabeling the groups. We note that the point-wise convergence rate ofb i depends on λ but the mean-square convergence rate ofb i and the convergence rate ofB k do not. In general, both the point-wise and mean-square consistency rely on the correct choice of the number of groups K. We illustrate the impacts of using a wrong value of K under two cases. First, when the choice of K is smaller than the true K 0 (i.e., K < K 0 ), the model is misspecified and leads to inconsistency for C-Lasso estimators, b i andB k . Second, when the choice of K is larger than the true K 0 (i.e., K > K 0 ), the group-specific slope parameters are still the same within each group which means that the slope estimators are consistent but lose some degree of efficiency due to less cross-sectional units. For the extreme case, we can consider the fully heterogeneous estimators where the number of groups equals the cross-section dimension (i.e., K = N). In this case, the group-specific estimators reduce to the time-series estimators of long-run cointegration vectors, which are still T-consistent.
For simplicity, we will writeB (k) asB k . Define the estimated For a rigorous statement of classification consistency, we define the following sequences of eventsÊ kNT,i = {i ∈Ĝ k |i ∈ G 0 k } and F kNT,i = {i ∈ G 0 k |i ∈Ĝ k }, where i = 1, . . . , N and k = 1, . . . , K. LetÊ kNT = ∪ i∈Ĝ kÊ kNTi andF kNT = ∪ i∈Ĝ kF kNTi . The eventŝ E kNT andF kNT mimic Type I and Type II errors in statistical tests:Ê kNT denotes the error event of not classifying an element of G 0 k into the estimated groupĜ k ; andF kNT denotes the error event of classifying an element that does not belong to G 0 k into the estimated groupĜ k . We adopt the following definition to investigate the asymptotic properties of classification.
The next theorem establishes the uniform consistency of the C-Lasso classification.
Theorem 3.3. Suppose that Assumptions 3.1-3.3 hold. Then Theorem 3.3 states the probability that at least one of the eventsÊ kNT orF kNT happens is approaching zero. SinceÊ kNT andF kNT mimic the Type I and Type II classification error events in group k, Theorem 3.3 implies the uniform classification consistency-all individuals within a certain group, say G 0 k , can be simultaneously and correctly classified into the same group (denotedĜ k ) w.p.a.1, and all individuals classified into a group, sayĜ k , indeed belong to the same group G 0 k w.p.a.1. In other words, under Assumptions 3.1-3.3, the misclassification errors are asymptotically negligible, which is the key for subsequent post-classification estimation and inference. The uniform classification consistency ensures the post-classification estimators enjoy the same asymptotic properties as the oracle ones that are obtained with the knowledge of group identities of all individuals. But it has nothing to do with the robustness or the uniform validity of the post-classification inference.
Next, we study the oracle properties of the PGLS-based estimation method. Given the estimated group {Ĝ k , k = 1, . . . , K}, we can readily pool the observations within each estimated group to obtain the post-Lasso estimator by (2.14) and (2.15) with/without bias correction. When the group identity for each individual is known, the oracle estimators can be obtained by replacing the i∈Ĝ k in (2.14) and (2.15) to i∈G 0 k . For example, when we do not consider bias correction due to endogeneity or the presence of common factors, we have the oracle estimator: The oracle property referring to that the Lasso-type estimators are asymptotically equivalent to the infeasible estimator vec(B oracle k ), which can be obtained if one knows all individuals' group identities. In the following theorem, we establish the oracle property of the PGLS-based C-Lasso estimators and their post-Lasso version. Let Q k,NT = Theorem 3.4. Suppose that Assumptions 3.1-3.3 hold. LetB post k andB post,bc k be as defined in (2.14) and (2.15). Then for k = 1, . . . , K, as (N, T) → ∞, Theorem 3.4 establishes the asymptotic mixed normality of two Lasso-type estimators for the long-run cointegration vectors. The asymptotic variance for eitherB k orB post k is the same as that of the oracle estimatorB oracle k . This oracle property is built on Theorem 3.3, which ensures the misclassification errors have asymptotically negligible impacts on the post-Lasso estimators. After bias-correction, the post-Lasso estimator enjoys the √ N k T-rate of consistency, which is faster than the usual Trate consistency for the time-series ECM estimators. Given the limiting distribution in Theorem 3.4, one can make inferences as if the true group identity is known.
Note that there is an asymptotically nonnegligible bias term (Q −1 k B k,NT ) in general for the post-Lasso estimatorB post k as long as we allow for endogeneity or the presence of common factors in the panel ECM. Such a bias term can be corrected via the usual dynamic OLS procedure to obtain the bias-corrected post-Lasso estimatorB post,bc k (see Step 4 in Section 2.2). To make inferences, it suffices to estimate the latter's asymptotic variance via the estimation of † k and Q k . Huang, Jin, and Su (2020). Similarly, Q k can be consistently estimated by Q k,NT with G 0 k and N k replaced by their estimatesĜ k andN k .
The above inference results largely hinge on the perfect parameter separability condition in Assumption 3.3(ii). So the parameters are essentially fixed in our framework and we have not addressed the uniform inference issue. For this reason, the above pointwise asymptotic distribution may provide a poor approximation to the finite sample distribution in the presence of misclassification. As is well known in the literature, post-selection or post-classification inferences are usually not uniformly valid; see, for example, Leeb and Pötscher (2005). This is also the case for our post-classification inference. Despite its importance, it is beyond the scope of this article to provide a thorough theory on the uniform inference.

Efficient Price in the Market Microstructure
In this section, we construct a new data-driven measure of efficient price and validate its economic foundations through several tests.

Efficient Prices in Market Microstructure
The market microstructure theory (e.g., Glosten and Milgrom 1985) suggests that the bid and ask prices share a common efficient price with the martingale property. In terms of econometric nomenclature, it reflects that the bid and ask prices are not only nonstationary but also cointegrated as they contain a common random-walk component. Therefore, we analyze the 2 × 1 security price vector p it by using the model in (2.4), where y it = p it , such that where p it = p bid it , p ask it stands for the ith stock's bid and ask (log-) prices in time t, β i is a 2×1 long-run cointegration vector, α 0 i is a 2 × 1 vector of short-run adjustment parameters, 0 il is a 2 × 2 full rank matrix of short-run dynamics parameters, 0 i F 0 t summarizes an unobserved short-run comovement component among different stocks, and L 0 controls for lag orders. This econometric specification allows us to capture the longrun equilibrium between the bid and ask prices and short-run dynamics around the equilibrium. Instead of focusing on the short-run parameters α 0 i in the literature, the linear cointegration vector β 0 i is the key to our analysis. Heterogeneity in the Bid-ask Long-run Equilibrium. The error-correction term β 0 i p it−1 reduces to a stationary process since β 0 i depicts a linear combination of the bid and ask quotes that helps to cancel the common random-walk component. To identify α 0 i and β 0 i in (4.1), we only need to impose one restriction here so that we can follow the literature and assume β 0 i = (1, b 0 i ) . That is, the first element of β 0 i is normalized to be 1. Then we have the error-correction term β 0 Thus, the bid-ask spread is given by This equation shows that the bid-ask spread can be decomposed into two components: (a) a permanent component derived by a fraction of the quoted prices, and (b) a transitory component which is the error-correction term. This econometric formation of the bid-ask spread is consistent with theoretical microstructure hypotheses, where the bid-ask spread incorporates both the microstructure frictions induced from the transaction and inventory costs and the information effects coming from the adverse selection problem. The latter arises when informed traders know more about future values than market makers do. The adverse selection risk is one of the most important factors that influence trading. The informed traders select the side of the market on which to trade to the disadvantage of market makers. Therefore, when facing traders with private information, less informed market makers enlarge the spread to compensate for this risk. Considering the impacts of bid-ask spread on asset prices, and the fact that microstructure frictions, the transitory component of bid-ask spread, only have secondorder effects (see Vayanos 1998), the private information in the order flow should be impounded in prices and lead the bidask spread associated with future asset returns. The adverse selection component in the spread therefore, has a permanent effect on the asset prices. When the bid-ask relationship is one-to-one such that b 0 i = −1, Equation (4.2) would become an identical equation and the bid-ask spread would only include a transitory item given by This error-correction term has already removed the common random-walk component in the bid and ask quotes, and thus, the permanent component does not exist anymore, the part that compensates market makers for losses when trading with well-informed traders. In this case, the bidask spread only includes a transitory component and cannot affect the dynamics of asset prices. However, this contradicts the empirical findings in Amihud and Mendelson (1986) who find that the expected asset returns are increasing function of bidask spreads. Therefore, the deviation of b 0 i from −1 reflects the existence of informed trading. In addition, for the firms with more informed trading, their b i would deviate more from −1.
To identify the deviations of b 0 i from −1, we need to obtain consistent estimators of the long-run cointegration vectors β 0 i from the quoted prices. Because the bid-ask relationships are heterogeneous across different stocks that reflect various unobserved firm-specific informed trading activities, the problem of unobserved parameter heterogeneity has to be addressed. In order to use the latent information contained in the long-run cointegration vectors β 0 i 's to better understand the components in the bid-ask spread, we need to find an effective way to estimate them.
Why Group Patterns Help. Classical panels either assume fully heterogeneous parameters or a common slope coefficient for all individuals. For the fully heterogeneous case, we can apply purely time-series regressions to each stock's price vector, p it . That is, we manage to estimate the long-run parameters β 0 i individually but fail to employ the cross-sectional information which would lead to inefficient estimates. For the common slope case, a pooled panel estimation helps to achieve efficient estimation but it imposes a strong prior belief on the parameter space and ignores the firm-specific heterogeneity. Therefore, we take an intermediate approach. We assume that the longrun cointegration vectors β 0 i 's exhibit certain unobserved group patterns. The key insight is that we can impose a certain type of parameter sparsity to maintain estimation efficiency and a limited degree of parameter heterogeneity.
The group patterns are typically unobserved and they may contain the economic-theory-based information. When estimating efficient price from bid and ask quotes, the unobserved heterogeneity patterns in the long-run coefficients β 0 i 's contain the private information of informed traders, which is not public available. Ex-ante subsample analyses cannot work if the pat-terns in the parameter space are not driven by the designated observed characteristics at all, which needs to call upon new estimation techniques. The difficulty of the unobserved group patterns in long-run cointegration matrices leads us to employ the novel C-Lasso method. The C-Lasso offers an effective way to jointly estimate the group-specific long-run cointegration parameters and identify the unknown group membership.

Construction of Efficient Price Measure
Based on the PGLS-based method, we can not only consistently and efficiently estimate the long-run cointegration vectors but also detect the unobserved group patterns in the bid-ask longrun cointegration relationships which are essential to construct the efficient price. In this part, we demonstrate how to construct our data-driven efficient price measure. We first investigate the Granger partial sum representation of quoted prices, which helps to formalize the definition of "efficient price. " Then we propose a cointegration-based (β-based) PT decomposition to derive the efficient price measure, which is a weighted-average of bid and ask prices. We also demonstrate advantages of the cointegration-based PT decomposition.
Granger Partial Sum Representation. The data generating process of one-minute quoted prices vector p it follows a panel ECM in (4.1) with J = 2 and r = 1. Given the Granger partial sum representation, we interpret the common random-walk component as the efficient price among the cointegrated prices vector p it . The key of this decomposition is to use the orthogonal complements of the short-run adjustment matrix α 0 i and longrun cointegration matrix β 0 i . We have we obtain the following Granger partial sum representation for p it : Initial condition . (4.3) From the above representation, we explicitly show that the bid and ask prices, p bid it and p ask it , share a common stochastic trend.
There is an additional stationary component related to the errorcorrection term as R i (L)β 0 i ε * it = β 0 i p it . We further combine the common permanent component with the initial condition to obtain a linear combination of the observed price vector: where the weights are controlled by the orthogonal complements of β 0 i and α 0 i . In previous studies, Gonzalo and Granger (1995, GG hereafter) employ this idea and propose a corresponding PT decomposition for a generic cointegrated nonstationary J-vector y it , such that Based on the GG's PT decomposition, Harris, McInish, and Wood (2002) propose a new measure for efficient price of multiple exchange price series. This measure is associated with m GG it and can be written as a linear function of the observed prices, where the weights are constructed from the short-run adjustment matrix (α 0 i ). In the case of bid and ask prices, y it = p it = p bid it , p ask it , most microstructure models assume β 0 i = (1, −1) (see, e.g., Hasbrouck 1995;Hansen and Lunde 2006) and β 0 i⊥ = (1, 1) / √ 2. Thus, the bid and ask prices can be summarized as a term of the efficient price plus a transitory component impounding various microstructure effects: . However, there are two problems in the above GG's PT decomposition. First, even though the efficient price has a weighed-average representation, any two linear combinations of the bid and ask prices share the same common permanent component. For example, the usual bid-ask midpoint still preserves as a good proxy for the unobserved efficient price.
The Bid-ask Midpoint = m it + 1 2 which is the efficient price plus a transitory component proportion to the bid-ask spread. The theoretical foundation behind this is that the two price series still obey the one-to-one relationship in the long run, the GG's weights only serve to cancel the transitory microstructure effects and has nothing to do with the permanent adverse selection risk from information asymmetry. It further implies that the bid-ask spread only contains the transitory frictions and has no predictive power on stock returns. Second, the estimates of short-run parameters α 0 i may be biased and inconsistent due to weakly dependent error processes which are commonly noted in market microstructure (see, Hansen and Lunde 2006). Thus, even the information asymmetry is not serious and the bid-ask relationship is one-to-one, the GG's efficient price measure may still be biased.
Therefore, we find that the key to solve the above problems is to use the long-run cointegration relationship β 0 i and allow the long-run cointegration vector to deviate from the one-to-one (1, −1) -relationship. Then β 0 i⊥ ≡ (β b i⊥ , β a i⊥ ) can differ from (1, 1) / √ 2 and we have In this case, the weights of bid and ask prices to the efficient price contain information from the long-run parameters β 0 i . In addition, the deviation directly introduces a permanent component into the bid-ask spread, such that p ask it −p bid it = (β a i⊥ −β b i⊥ )m it + stationary part. This is consistent with the decomposition in motivations (see Equation (4.2)).

Cointegration-based PT Decomposition.
We propose a new permanent-transitory decomposition method based on the long-run equilibrium vector β 0 i . Johansen (1995, Corollary 4.4 on p. 53) discusses this PT decomposition. Let P β 0 analogously. Due to the fact that P β 0 i⊥ + P β 0 i = I J , we can decompose p it as follows: it . This decomposition enjoys several good properties. First, this decomposition satisfies the definition of permanenttransitory decomposition proposed by Quah (1992). Second, the efficient price is also a linear combination of observed prices. In the meanwhile, the construction of m it depends on the long-run information, which reflects the information effects on the bid and ask weights. Third, the estimation of the long-run cointegration vectors is robust to the usual endogeneity, omitted variable bias problem, and weakly dependent error processes, which are commonly present in market microstructure data. Lastly, when comparing the efficient price in the GG's and βbased PT decomposition: tionary component, we observe that our efficient price measure m β it has the identical permanent component with m GG it . In sum, our β-based PT decomposition maintains the key features in Granger partial sum representation. The potential heterogeneity in the long-run cointegration relationships directly determines the asymmetric contributions of the bid and ask quotes to the efficient price.

Data and Estimation Results
In this section we empirically estimate the proposed data-driven efficient price measure. Our data are collected from NYSE's Daily Trade and Quote (DTAQ) database. Due to computation limitations, we extract one-minute quotes for S&P's 1500 stocks from 2004 to 2018. Following Holden and Jacobsen (2014), we combine both the "National Best Bid and Offer (NBBO)" and "Quote" files to compute the official complete NBBO. For each trading day, we compute the best bid and ask quotes at oneminute frequency during normal market hours between 9:30 a.m. and 4:00 p.m. for all eligible firms. To apply the C-Lasso method, we finally obtain a balanced panel with T = 390 best bid and ask prices and on average N ≈ 1000 stocks in each trading day. The C-Lasso estimation procedure performs well under this sample size. For each trading day, we implement the C-Lasso estimation procedure and simultaneously estimate the long-run cointegration parameters and recover the unobserved group structures. Specifically, we first impose latent group patterns on the bid-ask long-run equilibrium β 0 i = (1, b 0 i ) to control for the unobserved parameter heterogeneity, and set the number of groups K = 5 for all samples. We choose K = 5 to balance group variations and estimation efficiency. The choice of the number of groups does not influence our empirical results. We do robustness checks by changing K from 5 to 3 and 7, and the results are reported in supplementary materials. Thus, we can jointly obtain five group-specific bid-ask long-run estimates as well as data-driven stock classifications based on the groupspecific estimates. As a robustness check, we use different tuning parameters λ = c λ T −3/4 where c λ = {0.1, 0.25, 0.5, 1} in the C-Lasso estimation procedure.
We summarize the patterns of the efficient prices for possible ranges of b 0 i in Table 1. As we hypothesize in Section 2.2, the efficient price is closer to the side with informed trading. For example, when the efficient price measure is near the ask price for a specific stock, that is, the weight of bid quotes is lower than 0.5 (w b ∈ (0, 0.5)), the market maker should observe more order flows on the buy side and forecast it is highly possible that a good event occurs for this stock, which is eventually reflected on her quotations. In addition, our efficient price measure reduces to Table 1. General patterns of efficient price with different cointegrated relationships.

Range
Weights of β−based efficient the bid quote price measure The table presents the range of efficient prices based on different values of b i in the long-run cointegration vector (1, b i ) between the bid and ask prices. the midpoint when the bid-ask long-run equilibrium equals the one-to-one (1, −1) -relationship. Based on the comparison between the estimates (b i ) of the long-run coefficients and −1, we further classify the five datadriven categories into two groups: "midpoint" group and "nonmidpoint" group. The "midpoint" group includes stocks whose bid-ask long-run cointegration estimatesb i are closest to −1. Then the rest four groups are combined into the "nonmidpoint" group. From Table 2, we can see that more than 30% stocks in the S&P 1500 index are classified into the "nonmidpoint" group during the sample period. As we focus on the stocks in the S&P 1500 index, their information environment is better than other smaller and illiquid stocks whose efficient prices are more likely to deviate from the midpoint. In general, the results indicate that the bid-ask-midpoint is not always an ideal measure of efficient prices.

Economic Foundations of the Data-driven Method
In this section, we explore the underlying mechanisms that drive stocks to be classified into the "midpoint" and "nonmidpoint" groups, from which we can examine whether our data-driven classification is reasonable and consistent with economic theories. To explore this issue, we first construct a monthly variable, Efficiency ratio, Efficiency ratio = #days a stock is classified into the midpoint group #trading days to measure how often a stock is classified into the "midpoint" group. Efficiency ratio is calculated as the fraction of days in a month that a stock is grouped into the "midpoint" group and it, therefore, serves as a proxy for the extent of information efficiency. The higher Efficiency ratio, the efficient prices are more often equal to the midpoints of the bid and ask quotes in a month, and the informed trading is less likely to happen for this stock. We then regress Efficiency ratio on a wide array of stock characteristics that have been shown to be related to the behavior of informed trading.  (1) to column (4) by using different tuning parameters in minimizing PGLS criterion function (2.11). *, **, *** denote statistical significance at 10%, 5%, and 1%, respectively, with associated t-statistics in parentheses. Table 3 illustrates the regression results of Efficiency ratio on a battery of stock characteristics where the tuning parameter is set to be λ = c λ T −3/4 with c λ = {0.1, 0.25, 0.5, 1}. We see that larger firms, firms with lower book-to-market (BM) ratio and good past performance, are more often classified into the "midpoint" group. The institutional ownership is also positively related to the Efficiency ratio, which is consistent with the findings in Boehmer and Kelley (2009), who show that larger institutional holdings are associated with smaller price deviations from a random walk and with greater information symmetry. Moreover, the more volatile and illiquid stocks are associated with a lower Efficiency ratio, which is consistent with our expectation that stocks with values changing quickly and less liquid stocks suffer serious information asymmetry and are less likely to be classified into the "midpoint" group.
In sum, the results in Table 3 show that the classification of midpoint and nonmidpoint groups generated from the datadriven C-Lasso method is in accordance with the underlying economic intuitions. The stocks with little information asymmetry problems are more often classified into the "midpoint" group. However, the bid-ask midpoints are not the ideal measures of efficient prices for other stocks who suffer serious information asymmetry, and these stocks are automatically classified into the "nonmidpoint" groups by the C-Lasso procedure. Therefore, our efficient price measure provides a comprehensive and flexible solution to efficient price measure under different information environments in financial markets.

Validation Tests
In this section, we use firm information events that actually occurred to further justify the data-driven classification. Specifically, we conduct two tests to validate our measure and demonstrate that the direction of deviation is consistent with that of informed trading that actually occurred.
Our efficient price measure builds on the hypothesis that the efficient price deviates from the midpoint and moves toward the side with informed trading. Specifically, the group of efficient prices deviating from the midpoint to the ask price indicates that good information comes, and the other group of efficient prices that are close to the bid price indicates that bad information comes. To validate whether it is the case, we first examine the pattern of efficient price deviation prior to the earnings announcement. We hypothesize that stocks are more likely to go toward ask price (bid price) prior to the earnings announcement with positive (negative) earnings surprise. Since the midpoint deviation only happens when informed trading occurs, we require that the return run-ups prior to earnings announcement should have the same sign as an upcoming earnings announcement surprise.
We follow Linvat and Mendenhall (2006) to define earnings surprise, SUE, as the difference between the current earnings and last-year's earnings, scaled by the stock price. The "special items" are excluded from the Compustat Data. Return run-up, RetRunup, is the 10-day cumulative abnormal stock return prior to the earnings announcement days, which measures the abnormal price change immediately before the earnings announcements. The informed trading occurs when SUE and RetRunup have the same sign. Table 4 reports the average number of days deviating to ask price and bid price during 10 days prior to the earnings announcement in benchmark, informed-buy, and informed-sell groups, respectively. In general, we find that in more cases, the estimated efficient prices move toward bid prices, which is consistent with the average probability of deviation from midpoint group reported in Table 2. Column (1) shows the pattern of efficient price deviation in benchmark group where SUE and RetRunup have opposite signs indicating little possibility of informed trading. We compare the deviation patterns of informed-buy group (SUE > 0, RetRunup> 0) in column (2) and informed-sell group (SUE< 0, RetRunup< 0) in column (3) to that of in benchmark group, respectively. From the results in the forth row, we can see that the number of days deviating to bid price, indicating sell power, significantly decreases when informed buying occurs, and increases when informed sell occurs. However, there is no significant changes of deviation pattern to ask price.
Another important evidence to justify the informationoriented deviation from the bid-ask midpoint is to link the data-driven group classification results with trading activities by informed traders. We choose insider trading prior to earnings announcements as informed trading activities and test whether their trading direction is consistent with the direction of midpoint deviation. Corporate insiders have the most direct access to the firm-specific information. Jagolinzer, Larcker, and Taylor (2011) find that about 24% of all insider trades occur within the restricted trade windows, like the period before earnings announcement. Therefore, any evidence that the direction of insider trading prior to earnings announcement periods consistent with the deviation in our efficient price measure would support our hypothesis that the efficient price is closer to the side with informed trading. Specifically, if the deviation of the estimated efficient price really implies informed trading, the efficient price deviation to ask price and bid prices should be associated with insider purchases and insider sales, respectively. Therefore, we compare the insiders average purchases and sales during 10 days prior to the earnings announcements in two different cases: (a) In these 10 days, more than three days that the efficient prices move toward the ask prices and they do not approach bid prices in any of 10 days (that is, more likely the stocks are classified into the buypower group); and (b) more than three days that the efficient prices move toward the bid prices and they do not approach ask prices in any of 10 days (that is, more likely the stocks are classified into the sell-power group). The choice of three days is because about 30% probability of efficient prices deviates from the midpoint. In order to avoid wrong classification, our requirement is stricter where one-side deviation should occur more than three days during the 10 days prior to the earnings announcements. We aggregate the open market purchases and sales by all corporate insiders of the same firm on the same trading day. For a given stock, the pre-earnings-announcement insider purchases and sales are calculated as the annualized daily average proportion of shares bought and sold by insiders in the 10-day period prior to the earnings announcements, Table 5. Deviations from the bid-ask midpoints and insider trading before earnings announcements.
Stocks are more Stocks are more likely to be likely to be classified into classified into buy-power group sell-power group (1) Corporate insiders'average purchases 9.42% 3.42% Corporate insiders'average sales 6.29% 9.15% #Obs.
394 1903 NOTE: This table illustrates the insider average purchases and average sales during the 10 days before earnings announcements in buy-power group and sell-power group. Buy-power (Sell-power) group is defined as more than three days that the efficient prices move toward the ask (bid) prices and they do not approach bid (ask) prices in any of days during the 10 days before earnings announcements.
respectively. Column (1) of Table 5 shows that insider average purchases are about one and a half times as often as insider average sales prior to the earnings announcement to be classified into the buy-power group. This is a very strong evidence since insider sales is common in the market and can be driven by diversification or liquidity reasons. In column (2) where stocks are more often classified into sell-power group, insider sales are more than twice as larger as insider purchases. These results perfectly verify our hypothesis.

Conclusion
In this article, we consider a panel error-correction model with latent group structures to capture heterogeneity on the long-run cointegration vectors. We propose a novel data-driven method, C-Lasso, to consistently and efficiently estimate the long-run cointegration vectors. Based on the cointegration parameters, we construct a new weighted-average measure of efficient price and automatically classify stocks into the "midpoint" and "nonmidpoint" groups, and the bid-ask midpoint is a special case when stocks suffer little information asymmetry. Even though the classification is purely data-driven, we find the classification is nevertheless associated with economically meaningful drivers and consistent with firm information events that actually occurred. This fact suggests that our measure provides a realtime and effective way to identify the extent and direction of informed trading, which are generally unobserved but valuable to investors and market regulators. There remain some aspects that are left for future research. For example, it would be interesting to analyze the realized variance based on our efficient price measure, and examine the impacts of informed trading on the asset price dynamics.

Supplementary Materials
The online supplement presents the proofs of the main theorems in the article, the simulation results and some additional application details.