Large-Scale Generalized Linear Models for Longitudinal Data with Grouped Patterns of Unobserved Heterogeneity

ABSTRACT This article provides methods for flexibly capturing unobservable heterogeneity from longitudinal data in the context of an exponential family of distributions. The group memberships of individual units are left unspecified, and their heterogeneity is influenced by group-specific unobservable factor structures. The model includes, as special cases, probit, logit, and Poisson regressions with interactive fixed effects along with unknown group membership. We discuss a computationally efficient estimation method and derive the corresponding asymptotic theory. Uniform consistency of the estimated group membership is established. To test heterogeneous regression coefficients within groups, we propose a Swamy-type test that allows for unobserved heterogeneity. We apply the proposed method to the study of market structure of the taxi industry in New York City. Our method unveils interesting and important insights from large-scale longitudinal data that consist of over 450 million data points.


Introduction
"Clustering" is one of the most popular methods for grouping data. In a wide range of disciplines, various clustering approaches are employed. Examples include Park and Park (2016) to understand and predict online customers' store visits and purchase behaviors, Kong et al. (2010) to improve the efficiency of a liver allocation system, and France and Ghose (2016) to identify market structures. Additional examples include measuring "varieties of capitalism" by Ahlquist and Breunig (2012), and the analysis of microarray gene expression data by Qin (2006), Heard, Holmes, and Stephens (2006), Chiou and Li (2007), Cai, Ma, and Zhang (2019), among others. James and Sugar (2003) developed a flexible modelbased procedure for clustering functional data. Peng and Muller (2008) applied distance-based clustering to analyze online auction data. Handcock, Raftery, and Jeremy (2007) analyzed social networks based on model-based clustering. Delaigle, Hall, and Bathia (2012) developed approaches for clustering functional data. We refer to Hennig et al. (2015) for an overview of the growing number of interdisciplinary applications of clustering analysis.
This article attempts to identify the latent group structure within longitudinal data in the context of an exponential family of distributions. There is a large volume of literature on clustering individual units of "linear" longitudinal data, for example, Lin and Ng (2012), Bonhomme and Manresa (2015), Su, Shi, and Phillips (2016), Bai (2016, 2017), Vogt and Linton (2017), Wang, Phillips, and Su (2018), Lumsdaine, Okui, and Wang (2020), and so forth. Zhang, Wang, and Zhu (2019)  proposed a quantile regression-based method for panel data to identify subgroups and estimate group-specific parameters, which may not always allow one to analyze data from an exponential family of distributions, for example, binary responses.
In contrast to these previous studies, this article employs generalized linear models to capture grouped patterns of unobserved heterogeneity in longitudinal data. The setup includes probit, logit, and count data models with interactive fixed effects as special cases (see, Bai 2009). Studies on clustering longitudinal data with unobserved heterogeneity in the context of an exponential family of distributions are scant. Many previous studies on binary response longitudinal data with unobserved effects (e.g., Fernández-Val and Weidner 2016; Boneva and Linton 2017;Charbonneau 2017;Moon, Shum, and Weidner 2018;Ando, Bai, and Li 2022;Chen, Fernández-Val, and Weidner 2021) did not accommodate the unobserved group membership structure. Additionally, many studies have considered group-specific regression coefficients, and they all assumed that the regression coefficients are homogeneous across units within groups but can differ across groups. Such models are special cases of our general setup. Wang and Su (2021) considered a procedure for identifying latent group structures in nonlinear longitudinal data models when some regression coefficients are heterogeneous across groups but homogeneous within a group. Although such setups often simplify the estimation and inference procedures, the resultant conclusions can be unreliable when the group-specific homogeneity assumption does not hold. Our empirical analysis indicates that the group-specific homogeneous assumption is not supported by data. We, therefore, allow for heterogeneous regression coefficients even within groups.
We emphasize that the extension from a linear model to the generalized linear model is not a straightforward task. In terms of estimation, we need to develop an iterative algorithm that changes depending on whether it is used for group membership estimation, regression coefficient estimation or estimation of the unobservable interactive effects (factor structure). This extension also involves several theoretical challenges that must be overcome.
First, the group membership structure is assumed to be unknown and estimated from the data. A natural question is whether we can correctly identify the group membership structure. To address this important concern, we establish a "uniform" consistency for the estimated group membership. It should be noted that the number of individuals goes to infinity in our theoretical setting, and thus, the proof of "uniform" consistency is not straightforward. By addressing this challenge, we show that the proposed method can identify true latent group structures with probability approaching one. Second, our asymptotic theory also addresses the consistency of the estimated regression coefficients, as well as that of the estimated unobserved heterogeneity. We note that the regression coefficients vary across individual units, and the number of individual units is allowed to go to infinity. Thus, the study of asymptotic theory for regression coefficients is a difficult task. Third, with regard to checking whether the regression coefficients are heterogeneous within a group, no approach has been proposed for large-scale generalized linear longitudinal models with grouped patterns of unobserved heterogeneity. There are many studies that have tested for the homogeneity of regression coefficients in linear longitudinal data models, including those of Pesaran, Smith, and Im (1996), Phillips and Sul (2003), Pesaran and Yamagata (2008), Blomquist and Westerlund (2013) and Ando and Bai (2015). However, studies testing for the homogeneity of regression coefficients in longitudinal data models under an exponential family of distributions are very limited. We propose a Swamy-type test and investigate the asymptotic distribution of the proposed test statistic. This is the first result obtained by testing for the homogeneity of the regression coefficients in nonlinear longitudinal data models under an exponential family of distributions.
This article applies the proposed method to large-scale longitudinal data from the taxi industry in New York City (NYC). In NYC, the taxi industry is highly regulated by the NYC Taxi & Limousine Commission (TLC). Some of the restrictions on the industry include the use of common pricing schemes and limits to the total number of taxis. Additionally, profitability may vary across firms, and this is mainly due to the efficiency with which taxi firms use their resources. For example, one of the ways in which firms can improve their financial performance is through the improvement of their capacity utilization rate. This rate, for example, can be measured by the fraction of time that a driver has a fare-paying passenger. From a managerial perspective, the performance evaluation of each individual taxi can prominently benefit a firm. It will allow the firm to compare its performance with that of its competitors, and it will yield information for making managerial decisions to improve performance. We empirically examine the efficiency of yellow cab medallion taxis in NYC by applying the proposed clustering approach.
In summary, our paper makes the following contributions. First, we introduce new nonlinear longitudinal data models with grouped patterns of unobserved heterogeneity. Second, a new model estimation and selection procedure is developed for the introduced model. Third, we show that the true group membership and the proposed estimator are asymptotically equivalent. This result considers the uniform consistency of the estimated group membership and thus identifies an attractive property of the proposed estimator. Fourth, we derive an asymptotic theory for the estimated model parameters. It is shown that the asymptotic distribution of the estimated regression coefficients is multivariate normal. Additionally, the asymptotic distributions of the estimated common factors and factor loadings are also multivariate normal. Fifth, we propose a Swamy-type test to test homogeneity of regression coefficients within a group. Finally, we apply our methods to estimate the capacity utilization rates based on NYC yellow taxi data. Our method reveals interesting and important insights from a large-scale longitudinal dataset that consists of over 450 million data points.
The remainder of this article is organized as follows. Section 2 introduces the nonlinear longitudinal data model with a factor structure, and Section 3 describes the modeling, estimation, and model selection procedures. Section 4 investigates the consistency of the proposed estimator. Its asymptotic behaviors are also investigated, and further theoretical studies are conducted. To save space, all technical proofs are provided in the supplementary materials, which also contains Monte Carlo simulation results. Section 5 applies the procedure to the analysis of the TLC dataset. Concluding remarks are provided in Section 6.

Model
In this section, we introduce a new model: a generalized linear longitudinal data model with grouped patterns of unobserved heterogeneity. Suppose that the response of an individual unit is measured over T time periods together with some observable explanatory variables. For the ith individual unit (i = 1, . . . , N), at time t, its response y it is observed together with a (p + 1)-dimensional vector of explanatory variables x it = (1, x it,1 , . . . , x it,p ) . We let S be the number of groups (which is unknown) and let G = {g 1 , . . . , g N } denote the group membership such that g i ∈ {1, . . . , S}. Additionally, we denote N j as the number of individual units within group j (j = 1, . . . , S) such that N = S j=1 N j . To capture grouped patterns of unobserved heterogeneity in nonlinear longitudinal data, we consider an exponential family of distributions: where a(·), b(·) and c(·) are known functions, and the parameter θ it,g i is expressed as is a (p+1)-dimensional vector of regression coefficients and η it,g i denotes the unobservable effects that also explain the characteristics of the variability in y it . In this article, we assume that the unobserved structure η it,g i , which depends on the group membership g i , is modeled with a factor structure: where f t,g i is an r g i × 1 vector of group-specific unobservable factors and λ i,g i represents the factor loadings. We note that the explanatory variables may be correlated with the group-specific factors, factor loadings, or both. In such cases, ignoring the factor structure leads to inconsistent estimates of the regression coefficients (see, e.g., Bai 2009;Bai and Li 2014;Pesaran 2006). Combining (1), (2), and (3), our nonlinear longitudinal model is formulated as Below are some specific examples of model (4).

Example 1. Consider the standard linear model y it
Assuming that ε it follows a normal distribution with mean 0 and variance (4) is given as Example 2. Let y ij be a binary outcome such that y it takes the value 0 or 1 and let be a cumulative distribution function, and For the probit model, (x) is the cdf of a standard normal. Then Example 3. Let y ij be a count outcome. A Poisson model can be considered with the conditional expectation of y it written as where (y, α) is the probability mass function of a Poisson random variable with parameter α.
Other examples include binomial, inverse Gaussian, and gamma distributions. When y it is a selection out of multiple possible choices, a multinomial choice model can be considered.
The unknown parameters are the regression coefficients B = {b 1 , . . . , b N }, the group membership G, the group-specific factors F j = (f 1,j , . . . , f T,j ) and the corresponding factor loadings j = (λ 1,j , . . . , λ N j ,j ) for j = 1, . . . , S. The number of groups S and the dimensions of the group-specific factors are unspecified, and thus need to estimated. These estimation and model selection problems are investigated in Section 3.

Model Building
In this section, we describe our model-building framework. This involves estimating model parameters as well as identifying the number of groups and the numbers of group-specific factors.
Given the number of groups S and the number of factors in group r j (j = 1, 2, . . . , S), the estimator {B,Ĝ,F 1 , . . . ,F S ,ˆ 1 , . . . ,ˆ S } is defined as the maximizer of the following loglikelihood function: subject to normalization restrictions on F j and j . Following Bai and Ng (2013), we impose the following restriction: F j F j /T = I r j , and j j is the diagonal. Bai and Ng (2013)) demonstrated that this restriction leads to the identification of F j and j . We first note that finding the exact maximizer of the likelihood function (5) subject to normalization restrictions is not an easy task. This is largely because the number of possible combinations of the group membership G can be enormous when the number of individuals N is large. In that case, it is computationally infeasible to explore all possible combinations of group membership. Because the parameters depend on one another, we need to update the set of parameters sequentially.
Given the group-specific factorsF j (j = 1, 2, . . . , S), we can easily update the group membership g i , the regression coefficients b i and the factor loadings λ i,g i as for i = 1, . . . , N. GivenB,¯ 1 , . . . ,¯ S andḠ, we update the group-specific factors f t,j as for j = 1, . . . , S. Then, we calculate the matrix where U j is an orthogonal matrix and V j is a diagonal matrix. The group-specific common factors and the factor-loading matrices satisfying the normalization restriction are given as follows: To capture the dependencies between the regressors and unobservable structures simultaneously due to the endogeneity problem, we use the following algorithm. Although the computation is approximate, as is mostly done in practice for clustering analyses, it allows us to obtain approximate solutions quickly.

In
Step 1, starting values are needed. To obtain the initial group membership G (0) , we simply apply the well-known Kmeans algorithm to the longitudinal data of y it . This algorithm divides the individual units into S groups. An initial estimate of b N) is obtained via the standard maximum likelihood approach by ignoring the factor structures. Given G (0) and b j } for group j by applying the standard principal component approach to y it , which belongs to group j.
While the convergence of the algorithm to a local maximum is guaranteed, the proposed algorithm cannot guarantee convergence to a global optimum (see, e.g., Bai 2009). To minimize the chance of a local minimum, multiple starting values from the Kmeans algorithm can be used for Step 1. Following Bai (2009), upon convergence, we choose the estimator that gives the largest likelihood function value.
In practice, the number of groups S and the number of groupspecific factors {r 1 , . . . , r S } are unknown. We use an information criterion (IC) to select these quantities. The IC criterion is defined as By minimizing the IC, we can choose the number of groups S and the number of group-specific factors k j (j = 1, . . . , S).
Remark 1. While this article focuses on (4), a researcher may be interested in estimating a model having both global common factors across all groups and group-specific factors: where f t,c is an r c -dimensional vector of global common factors across all groups and λ i,c is the corresponding r c -dimensional factor-loading vector. For the linear case, Ando and Bai (2017) studied its estimation procedure. By extending their estimation procedure to the exponential family of distributions, one can estimate the parameters in model (9). To avoid repeating the similar argument provided in this section, an estimation procedure for (9) is described in Appendix A of the supplementary materials.

Convergence to a Local Maximum
Given B, 1 , . . . , S and G, the log-likelihood function in (5) is concave in F 1 , . . . , F S . Therefore, we have where F new 1 , . . . , F new S can be obtained from (7). Under the fixed {F 1 , . . . , F S } and G, the log-likelihood function in (5) is concave in B (G) and 1 (G), . . . , S (G), where we emphsize the dependency of B and 1 , . . . , S on G. This implies that, for any G, the following inequality holds: with {F 1 , . . . , F S } and G being fixed. Thus, under the fixed value of the group-specific factors {F 1 , . . . , F S }, (6) leads We therefore have where we used (10) for the first inequality and (11) for the last inequality. This implies that the convergence of the algorithm to a local maximum is guaranteed at least. However, it is known that convergence to a global optimum is not guaranteed for interactive effects panel data models (see, e.g., Bai 2009).

Assumptions
We first introduce some notations. Let A = [tr(A A)] 1/2 be the Frobenius norm of the matrix A, where "tr" denotes the trace of a square matrix. Equation a n = O(b n ) states that the deterministic sequence a n is, at most, of order b n ; c n = O p (d n ) states that the random variable c n is, at most, of order d n in probability, and c n = o p (d n ) is of smaller order than b n in probability.
The true regression coefficient is denoted by B 0 = {b 1,0 , . . . , b N,0 }. The true group-specific factor and the factor loading of individual i with true group membership g 0 i = j are denoted as F j,0 = (f 1,j,0 , . . . , f T,j,0 ) and λ i,j,0 , respectively. We denote the true factor-loading matrix for the jth group as j,0 = [λ 1,j,0 , . . . , λ N j ,j,0 ] , where N j is the number of individuals in group j. Because the dimensions of j and F j are divergent in dimension, we cannot use the standard regularity conditions for likelihood functions. The set of regularity conditions for likelihood function with a large number of parameters is as follows:

Assumption A: Common Factors
Let F j be a compact subset of R r j . For each group j = 1, . . . , S, the group-specific factors f t,j,0 ∈ F j satisfy T −1 T t=1 f t,j,0 f t,j,0 = I r j .

Assumption B: Factor Loadings and Regression Coefficients
(B1): Let B and L j be compact subsets of R p and R r j , respectively. The regression coefficient b i and the factor-loading satisfy b i ∈ B and λ i,g i ∈ L j . Additionally, the factorloading matrix j,0 satisfies N −1 j j,0 j,0 → j as N j → ∞, where j is an r j × r j diagonal matrix with distinct diagonal elements and the smallest diagonal element bounded away from zero.
where a 1 , a 2 , b 1 , and b 2 are positive constants.

Assumption C: The Conditional Densities
Let A be the collection of F j such that A = {F j : F j F j /T = I}. We assume where (D4): For N j individuals belonging to the jth group, let ϒ(B j ) be an N j × T matrix with its (i, t)th entry equal to x it b i(j) , where B j = (b 1(j) , b 2(j) , . . . , b N j (j) ) are the regression coefficients corresponding to individuals belonging to the jth group. There exists a positive constant c > 0 such that, with probability approaching one,

Assumption E: Number of Units in Each Group
(E1): The true number of groups, S, is finite and independent of N and T. (E2): All units are divided into a finite number of groups S, with each containing N j units such that 0 < a < N j /N <ā < 1; this implies that the number of units in the jth group increases as the total number of units N grows.
Remark 2. Some comments regarding the assumptions are in order. Assumptions A and B1 imply the existence of r j groupspecific factors, j = 1, . . . , S. To identify the true group membership, Assumption B2 is needed. Assumption B3 assumes that the unobservable factor structure is strongly mixing with a faster than polynomial decay rate and a restricted tail property. This condition is used to bound the misclassification probabilities. Assumption C1 assumes that the independence property of y it is conditional on the explanatory variables and the factor structure. Assumption C2 restricts the tail property of the error. Assumption C3 ensures the maximization is well defined. We may assume the expected value of the second derivative is negative, that is, replacing y by E(y) = b (θ )/a (θ ). Assumption D1 requires some moment conditions for the explanatory variables. The explanatory variables can be correlated with the groupspecific factors, factor loadings or both. Assumption D2 is necessary for the consistent estimation of regression coefficients even if the group-specific factors are observable. Assumption D3 is used for proof of consistency when the factors and factor loadings are also estimated. Assumption D4 is used to ensure the consistency of b i . It requires sufficient variations in x it . A similar condition was also used by Ando, Bai, and Li (2022). This assumption rules out common regressors such that x it does not vary with i. But Ando, Bai, and Li (2022) argue that common regressors in fact present no problem. The same argument is applicable here. For regressors x it = x i (not changing with t), it will make more sense to allow its coefficient to be time varying such that b t . One may regard x i as observable factor loading, and b t as factors. The model will have an enlarged factor space. Some detailed discussion for common and time-invariant regressors are considered by Bai and Li (2014). Assumption E1 can be relaxed so that S also increases as N, T → ∞. However, this is outside of the scope of this article. Assumption E2 was also used by Ando and Bai (2017).

Asymptotic Results
This section investigates some asymptotic properties of the proposed estimators. All proofs of the theorems described below are provided in the materials. The true parameter values {B 0 , G 0 , F 1,0 , . . . , F S,0 , 1,0 , . . . , S,0 } are defined as the maximizer of the expected likelihood function E[L(B, G, F 1 , . . . , F S , 1 , . . . , S )] subject to the identification restriction on F j and j . Here, the expectation is taken with respect to the true conditional distribution of {y it : i = 1, . . . , N, t = 1, . . . , T}, conditional on X, G 0 , F j,0 and j,0 . We note that all asymptotic results are obtained with N, T → ∞. Our theoretical results are obtained by allowing the number of regression coefficient vectors N to diverge to infinity. The restrictions on the relative rates of convergence of N and T are specified below. We first claim that the estimated factors are consistent in the sense of an averaged norm.
Theorem 1 (Consistency). Under Assumptions A-E, the estimator {F j , j = 1, . . . , S} is consistent in the sense of the following norm: This indicates that the space spanned by the common factor is estimated consistently. With this result, we argue that the estimated group membership is a consistent estimator of the true membership. The estimates of B, {F j , j ; j = 1, . . . , S}, and G depend on each other. Following Ando and Bai (2016), we denote the estimator of group membershipĝ i asĝ i (B,F,ˆ ), withF = {F 1 , . . . ,F S } andˆ = {ˆ 1 , . . . ,ˆ S }. The next theorem claims that the estimated group memberships are consistent in the sense of an averaged norm.
Theorem 2 (Average consistency of the group membership estimator). Suppose that the assumptions in Theorem 1 hold. Then, Although Theorem 2 ensures that the group membership estimator is accurate in the average sense, it is ideal to show that the group membership estimator can accurately estimate the true membership across all individuals i = 1, . . . , N. The next theorem shows that the estimated group membership converges to the true group membership as T and N go to infinity.

Theorem 3 (Consistency of the estimator for group membership).
Suppose that the assumptions in Theorem 1 hold. Then, for all τ > 0 and T, N → ∞, we have We note that this uniform result holds across all individuals whose factor loadings are bounded away from zero, as in Assumption B. When N/T a → 0 for some a > 0, the true group membership g 0 i and the estimatorĝ i (B,F,ˆ ) are asymptotically equivalent. Theorem 3 is similar to results obtained by Bonhomme and Manresa (2015) and Ando and Bai (2017) for "linear" longitudinal models with unknown group memberships. Theorem 3 asserts that group membership consistency can be obtained even when the data are generated by an exponential family of distributions.
Next, we show that the asymptotic distribution of the estimated regression coefficients is a multivariate normal distribution. This finding is also true for the estimated factor loadings.
Let γ i = (b i , λ i,g i ) denote the vector of regression coefficients and the factor loadings, and let γ i,0 = (b i,0 , λ i,g 0 i ,0 ) denote its true value. We useγ i = (b i ,λ i,ĝ i ) to denote the maximizer of the likelihood function (5) subject to normalization restrictions on the factor structure. The next theorem states that the asymptotic distributions ofγ i and the estimated common factorsf t,j are consistent and asymptotically normal so hypothesis test and confidence intervals can be constructed.

Theorem 4. Under Assumptions A-E,
√ T/N → 0 and √ N/T → 0, we have (i) the asymptotic distribution of T 1/2 (γ i − γ i,0 ) is a multivariate normal with mean 0 and a covariance matrix of represent the first and second derivatives of the known functions a(·) and b(·) in (1).
The proof of the theorem is given in the online supplementary document. We make some comments about the theorem. First, the asymptotic variance is standard. For the estimated regression coefficients and factor loadings (γ i ), the limiting variance is the same as if the common factors f t and thus z it = (x it , f t ) were observable [part (i) of the theorem]. For the estimated factor (f t ), its asymptotic variance is the same as if the factor loadings were observable [part (ii) of the theorem]. These standard errors are automatically computed by the usual generalized linear model packages (e.g., glm in the R programming language). Second, the limiting variance is stated in a sandwich form. For binary model (probit or logit), it is not necessary to use the sandwich form, see, for example, Cameron and Trivedi (2005). So for binary y it , the limiting variance forγ i is simplified to −1 i , and the variance off t,j is simplified to −1 t,j . Third, the expression for h it,0 can be replaced by b (·) − y it a (·), since b (·)/a (·) is the conditional expectation of y it . Furthermore, for many models (e.g., linear, logit, and Poisson), a (·) ≡ 0, and thus h it, In the case of logit, with (x) = e x /(1 + e x ), and for Poisson models,

Testing the Homogeneity of the Slope Coefficients
To test whether the regression coefficients within groups are homogeneous over individual units, we propose a Swamy-type test. Let {b 1 , . . . , b N j } be the regression coefficients of N j individual units that belong to the jth group. The null hypothesis of interest is The alternative hypothesis is In our context, the Swamy's test statistic for the homogeneity of the regression coefficients within groups takes the form We state the assumptions needed for the asymptotic analysis of the testing procedure.

Assumption F: Central Limit Theorem
As T goes to infinity, We assume, as N, T → ∞,

Assumption G: Joint Limit
Remark 3. Assumptions F and G are pararell to Assumptions E and F in Ando and Bai (2015) where the linear case is studied. This is because the second derivative of b(θ ) in (14) is 1. Assumption G imposes the central limit theorem for ζ j,i . Note that, the expected value of ζ j,i −1 j,i ζ j,i is equal to p.
The following theorem provides the asymptotic distribution ofˆ j under the null hypothesis.
Theorem 5 (Swamy-type test). Suppose that assumptions in Theorem 4, Assumptions F and G hold. Then, under the null hypothesis H 0 ,ˆ j → N(0, 1) in distribution as T, N → ∞.
Theorem 5 suggests that the proposed test statisticˆ is asymptotically normal with mean 0 and unit variance under the null hypothesis of slope homogeneity. The testing procedure is easy to implement. A more general form of test statistic is developed in Appendix F of the supplementary materials.

NYC Taxi Data and an Empirical Model
The trip record data for yellow cab taxis are made publicly available by the NYC Taxi & Limousine Commission. We use a dataset made available by Brian and Dan (2016) Because the pricing structure is strictly regulated by authorities, it is important for individual taxis to use their vehicles efficiently. To understand the efficiency, one of the measurements is the capacity utilization rate, which is measured by the fraction of time that a driver has a fare-paying passenger. For the ith taxi, we let y it be the capacity utilization rate at time t. More specifically, the capacity utilization rate is defined as the fraction of time that a driver has a fare-paying passenger every hour, that is, y it /60, where y it is the total number of minutes per hour during which a cab drives a fare-paying passenger. Figure 3 shows the average capacity utilization rate at the time for a particular day of the week and time frame in 2010. Similar patterns were observed for the other years (See Appendix H in the supplementary materials. It contains similar plots for 2011, 2012, and 2013). We can see that the capacity utilization rate varies over the time frame. On average, the capacity utilization rate at midnight is lower than that during the daytime on weekdays. In contrast, the capacity utilization rate at midnight during the weekend is relatively high.
To analyze the capacity utilization rate, we fit the following model: where y it is the fraction of the total number of minutes per hour during which a taxi is driving a fare-paying passenger for the time period t, and 60 y it is the binomial coefficient.
In our analysis, we employ the following information for x it : Time frame (every hour: MIDNIGHT-1 a.m., 1 a.m.-2 a.m., …, 10 p.m.-11 p.m., 11 p.m.-MIDNIGHT), Day of the week (Monday, Tuesday,…, Friday, Saturday, Sunday). By combining "time frame" and "day of the week", we create a set of 168(= 24 × 7) indicator variables. Because the regressors are common over individuals, (15) becomes where x t is the vector of indicator variables. To avoid multicollinearity, the intercept term is not included in x t .

Estimation Results
To understand how the market structure evolved over time, we analyze the set of 48 longitudinal datasets. Here, 48 denotes 12 months × 4 years (2010, 2011, 2012, and 2013 (5). To determine the number of groups S and the number of group-specific common factors, we apply the information criterion in (8). The maximum number of groups is set to S max = 5 when we search for the best value based on the information criterion in (8). Table 1 summarizes the estimated number of groupsŜ for the set of 48 longitudinal datasets. Here,r j is the selected number of common factors for the jth group. We can see that the  2  4  3  2  2  2  2  2  2  2011 2  4  2  3  2  4  2  2  2  2  2  2  2012 2  2  4  2  2  2  2  4  2  2  2  3  2013 2  2  2  3  3  3  2  2  2  3  2  4 estimated number of groupsŜ is two in almost all months, while there are some variations. Figure 4 shows the Sankey diagram, which represents how the set of individual taxis form groups from one month to the next in 2010 and 2013. If the market structure is stable, group membership should be stable over a 12-month period. However, the Sankey diagram indicates that the membership has been shuffled every month. Similar observations are obtained for 2011 and 2012. (See Appendix H in the supplementary materials, which contains monthly plots for 2011 and 2012). Table 2 summarizes the total sum of the number of common factors Ŝ j=1r j for the set of 48 longitudinal datasets. From Table 2, we see that there are some unobservable variables that explain the performance. The model structure implies that such unobservable structures are a source of variation in group membership changes. The collection of all possible explanatory variables that may influence the response variables is sometimes costly due to restrictions on time, budget and so on. Our method, which automatically captures unobserved  heterogeneity through the factor structure, is useful because it can reduce costs for such practical issues.
Finally, we implement the proposed homogeneity test for the regression coefficients. According to our proposed test, for each group and each month from 2010 to 2013, the null hypothesis (the regression coefficients within groups are homogeneous across individual taxis) is rejected at the 5% level. This indicates that a model with some regression coefficients that are heterogeneous across groups but homogeneous within a group is not sufficient to capture the heterogeneity in the data. Thus, our proposed model and methods are important in that the groupspecific homogeneous regression coefficients will lead to biased results.

Concluding Remarks
The proposed method is a flexible approach for capturing grouped patterns of unobserved heterogeneity in nonlinear longitudinal data models. To handle the endogeneity issue, we developed a new estimation procedure with which the regression parameters, unobservable factor structure, and group membership were all estimated jointly. The consistency and asymptotic normality of the estimated regression coefficients were established even in the presence of unknown group memberships. We showed that group membership can be estimated consistently. We also examined the problem of testing the homogeneity of regression coefficients and developed a new testing procedure based on Swamy's (1970) test principle. If the test implies that the regression coefficients are homogenous within a group, the proposed model (4) with group-specific regression coefficients can be applied.
The proposed procedure was applied to the NYC taxi dataset and revealed useful information for taxi firm management. Although we analyzed the NYC taxi dataset, our method can be applied to datasets in different fields, including economics, finance, marketing, political science, medicine, and other areas.
Finally, advancements in information technology may create a situation where the number of explanatory variables is large.
To accommodate this situation, our proposed method can be extended to address cases with large numbers of explanatory variables. One of the most commonly used ideas is to use shrinkage methods that simply add a regularization term into the log-likelihood function (5). One may consider the lasso penalty (Tibshirani 1996) and its variants (Zou 2006;Yuan and Lin 2006), the elastic net penalty (Zou and Hastie 2005), the minimax concave penalty (Zhang 2010), the SCAD penalty of Fan and Li (2001) and Fan and Peng (2004), and so on. We would like to investigate this topic in a future study.

Supplementary Materials
All technical proofs are provided in the online supplementary document, along with Monte Carlo simulation results. The supplement also contains additional figures for the empirical analysis. The proposed methods are implementable through the R package PDMIF by Ando and Fayad (2022).