Robust estimation of panel data regression models and applications

Abstract The common parameter estimation methods of panel data linear model include least square dummy variable estimation, two-stage least square estimation, quasi-maximum likelihood estimation and generalized moment estimation. However, these estimation methods are not robust and are easily affected by outliers. Firstly, this paper extends support vector regression algorithm to fit several parallel super-plane simultaneously and provide a novel robust estimation of fixed-effect panel data linear model; then using the kernel trick, a robust estimation for fixed effect panel data nonlinear model is introduced. Finally, the proposed model (linear or nonlinear) is applied in forecasting air quality index of the cities of Jing-Jin-Ji district in China. Experiments shows that our proposed model are robust and have good generalization performance.


Introduction
Support vector machines (SVMs) are a kind of powerful algorithms, which become increasingly popular in artificial intelligence, data mining and classifications. The basic algorithms of SVMs can be modified to do regression analysis (Vapnik 1998), it has achieved good performance on many regression analysis and time series prediction (Hansen, McDonald, and Nelson 2006;Yang 2020). By means of the kernel functions, SVR can construct a variety of nonlinear functions.
Panel data are data information on each individual unit over time, panel data contain rich information than time series or cross-sectional data, it allows researchers to estimate complex models and solve some problems that may not be possible using time series or cross-sectional data alone.
There are several estimation methods to estimate the fixed effects panel data model, they are classified into three kinds, one is based on generalized least square estimation (P€ utz and Kneib 2018;Li, Yin, and Peng 2020;Rodriguez-Poo and Sober on 2015), another method is quasi-maximum likelihood (Sun and Huang 2022), the third generalized moments estimation (Kim 2020).
Many economic problems are expressed as a nonlinear relationship, nonlinear panel data models could gain more information from data sources than classical models (Chen, Fern andez-Val, and Weidner 2021). The most popular nonlinear models are logit models (Chen, Fern andez-Val, and Weidner 2021) and probit models (Lieberman and M aty as 2007). Estimation method of nonlinear models with individual effects is the conditional maximum likelihood estimation (Kim and Sun 2016).
The estimation methods mentioned above, both linear panel data model and nonlinear panel data model, are not robust and are vulnerable to outliers. Since SVR algorithms estimate a decision function by minimizing an upper bound of the generalization error, and the generalization error is bounded by the sum of the training error and a confidence interval that is determined by the Vapnik-Chervonenkis (VC) dimension, so it is proved to be very resistant to over-fitting and can achieve good generalization performance. Extending SVR algorithms to fitting panel data set, constructing robust estimation method for panel data model has important theoretical significance and application value.
The rest of this paper is organized as follows, the related preliminaries are reviewed in section 2. In section 3, a novel robust estimations of fixed-effect panel data linear (nonlinear) models are introduced. At last, the proposed fixed effect panel data (linear or nonlinear) models are applied in air quality index (AQI) forecasting of the cities of Jing-Jin-Ji district in China, experiments show that our proposed models are robust and have good generalization performance.

Preliminary
In this section, some related preliminaries are provided.

Support vector regression
For a given training data ðx 1 , y 1 Þ, ðx 2 , y 2 Þ, :::ðx N , y N Þ 2 R n Â R: In SVR of Vapnik (Vapnik 1998), The SVR is to find a function which has the following form: SVMs estimate the function (1) by solving the following optimization problem s:t: Where 1 2 kwk 2 is a regularization term, C is a constant specified beforehand that controls the tradeoff between minimizing the training error term and maximizing the margin. e is the tolerance (error acceptance), n i , n Ã i ði ¼ 1, 2, :::, NÞ are slack variables that measure the amount of violation of the constraints.
The optimization problem (2) has a global optimum, and its dual problem is as follows: By the Karush-Kuhn-Tucker (KKT) conditions, the slope b can be calculated as follows: Thus the linear function (1) can be calculated as follows, To learn a nonlinear function, The SVR first maps the data to some higher dimensional feature space, and then constructs a separating hyperplane in this space. The mapping to feature space is denoted by R n ! H, x 7 ! uðxÞ, Kðx, zÞ uðxÞ T uðzÞ is a kernel function.
Then the nonlinear decision functions f ðxÞ ¼ w Á /ðxÞ þ b can be calculated by the following optimization problem: Where C is the regularization parameter that controls the tradeoff between minimizing the training error term and maximizing the margin, the dual problem of (5) is the following quadratic optimization: After finding the optimal solutions a i , a Ã i by maximizing optimization problem (6), the slopeb can be computed as follows: Then the nonlinear decision functions f ðxÞ ¼ w Á /ðxÞ þ b can be calculated as follows: The most widely used kernel is Gaussian radial basis kernel function: Kðx, zÞ ¼ exp½Àckx À zk 2 , (c > 0Þ:

Linear regression model for panel data
Panel data is a pooling of observations on a cross-section of individuals (such as households, countries, firms, etc.) over several time periods, also called longitudinal data. For a given panel data set E ¼ ðX it , y it Þ : i ¼ 1, 2, :::, N; t ¼ 1, 2, :::, T È É , where y it is the dependent variable, and the independent variables are X it ¼ ðx it1 , x it2 , ::::, x itk Þ, i ¼ 1, 2, :::, N; t ¼ 1, 2, :::, T: A panel data regression model differs from a cross-section regression or regular timeseries in that it has a double subscript on their variables. Suppose the panel data model with individual-specific effects is as follows, Where b i is an unobserved effect for individual i, time invariant. The parameter a i , b can be estimated by least-squares dummy-variable (LSDV) (Badi Baltagi 2008).

SVR-based fixed effects panel data model
For the given panel data set E ¼ ðX it , y it Þ : X it 2 R n , y it 2 R, i ¼ 1, 2, :::, N; È t ¼ 1, 2, :::, Tg, the estimation of generalized least square based methods are easily affected by the outliers, they are not robust. SVR is a robust algorithm. In the following, SVR algorithms are first extended to fit several parallel super-planes simultaneously, and a novel estimation of fixed effect panel data linear models is provided, then using the kernel trick, estimation of fixed effect panel data nonlinear models is introduced.

SVR-based fixed effects panel data linear regression model
In this section, the panel data linear regression model with individual-specific effects (7) is discussed.
Based on statistical Learning Theory (Vapnik 1998), the following is to find the coefficients w ¼ ðw 1 , w 2 , :::, w p Þ T of panel data model (7) by the following generalized SVR algorithm.
The generalized SVR algorithm estimates the function (7) by minimizing the following regularized risk function The term kwk 2 is called the regularized term. L e ðy it , f ðX it ÞÞ is the e À insensitive loss function.
The term P N i¼1 P T t¼1 L e ðy it , f ðX it ÞÞ is the empirical error which is measured by the e-insensitive loss function. C is the regularization constant which plays a trading-off role between the empirical error and the regularized term.
By using the positive margin slack variables n it , n Ã it , Eq. (9) is equivalent to the following Panel-SVR model, The Lagrange function of the primal problem (10) is formulated as follows, where the dual variables in Eq. (11) satisfy positivity constraints, i .e. a it ; a Ã it ; g it ; g Ã it ! 0: It follows from the saddle point condition that the partial derivatives of L with respect to the primal variables are equal to zero. That is, Substituting (13)-(16) into (10) yields the dual quadratic optimization problem (Panel-SVR) After obtaining the optimal solutions a it , a Ã it of optimization problem (17), then w ¼ P N i¼1 P T t¼1 ða it À a Ã it ÞX it , and meanwhile by Karush-Kuhn-Tucker (KKT) conditions, a it ðe þ n it À y it þ w T X it þ b i Þ ¼ 0; a Ã it ðe þ n Ã it þ y it À w T X it À b i Þ ¼ 0: Then for a fixed individual i the individual effect b i can be computed as follows: Þþ e for a Ã it 2 ð0, CÞ Then the panel data linear model with individual-specific effects can be constructed as follows: ða js À a Ã js Þ X js Á X it ð Þ, i ¼ 1, 2, :::, N; t ¼ 1, 2, :::, T

SVR-based fixed effects panel data nonlinear regression model
For the given panel data set E ¼ ðX it , y it Þ : X it 2 R n , y it 2 R, i ¼ 1, 2, :::, È N; t ¼ 1, 2, :::, Tg, if independent variables and X it related to dependent variables y it according to the nonlinear panel data relationship with individual-specific effects: Where / : R n ! H mapping from R n to some higher dimensional feature space H, Kðx, zÞ /ðxÞ T /ðzÞ is kernel function. SVRs estimate the panel data nonlinear regression model (19) by solving the following optimization problem Similar to construct the dual optimization problem (17), the dual of the optimization problem (20) can be derived as follows: After obtaining the optimal solutions a it , a Ã it of optimization problem (21), then w ¼ P N i¼1 P T t¼1 ða it À a Ã it Þ/ðX it Þ: By KKT conditions, for a fixed individual i, the individual effect b i can be computed as follows: js ÞK X js , X it ð ÞÀ e for a it 2 ð0, CÞ or ða js À a Ã js ÞK X js , X it ð Þþ e for a Ã it 2 ð0, CÞ Then the panel data nonlinear regression model with individual-specific effects is derived as follows:ŷ

Monte Carlo experiments and practical applications
In this section, the proposed forecasting models are evaluated by Monte Carlo experiments and practical applications in AQI forecasting. E is used as a criteria to quantitatively evaluate the performance of the proposed model.
Where y it is the actual data of AQI,ŷ it is the corresponding predictive data for individual i at time t, i ¼ 1, 2, … , N, t ¼ 1,2, … , T.

Monte Carlo experiments
In order to assess the finite sample properties of our proposed SVR-based fixed effects panel data nonlinear regression model, Monte Carlo experiments are conducted. For the experiments, the following static nonlinear panel data model is considered for the data generation processes (DGP): the random errors e it are assumed to be independently identical distribution, e it $ Nð0, 1Þ and x 1it , x 2it are sorted randomly according to the following distributions: x 1it from a uniform distribution uniform on interval [2,3]; x 2it from a uniform distribution uniform on interval [10,12].The panel data set R is generated for T ¼ 100, R ¼ ðx 1it , x 2it , y it Þ i ¼ 1, 2, 3; t ¼ 1, 2, :::, 100 j g : È The generated panel data set R is applied in training the SVR-based fixed effects panel data nonlinear forecasting model (21). Gaussian radial basis kernel function Kðx, zÞ ¼ exp Ày, kx À zk 2 Â Ã is used, 10-fold cross validation and gird search method are used to select the optimal parameter, the parameter C ¼ 0.6526,c ¼ 0:0069: The mean of training error E is 2.0357 and the mean of test error E is 2.2628. Monte Carlo experiments show that our proposed forecasting model has good performance.

Practical application in AQI forecasting
Recently, the general public is very sensitive to forward trends in air quality, polluted air can cause allergies and diseases, even death (He, Ding, and Prasad 2019;Liu, Xu, and Yang 2018). The AQI is an index that quantitatively describes air quality level, it can be estimated based on six pollutants-PM10, PM2.5, SO2, CO, NO2, O3. Real-time air quality information is very important for human health protection and air pollution control (Lin and Zhu 2018), accurate air quality forecasting can provide important support for urban environmental management decision-making (He 2018).
The mentioned AQI forecasting methods used time series or cross-sectional data alone. Panel data models offer a certain number of advantages over pure time series data sets or pure cross section (Frees and Miller 2004).
The AQI is affected by six pollutants-PM10, PM2.5, SO2, CO, NO2 and O3, AQI data were collected from the website https://www.zq12369.com/. Daily data of AQI and related six pollutants (PM10, PM2.5, SO2, CO, NO2, O3) were obtained across 9 cities in Jing-Jin-Ji district over the period 1, November 2018 to 31, December 2018. The panel data for the period 1 November 2018 to 15 December 2018 are selected as training data set, the rest panel data as test data sets.
Let y it be dependent variable, which is the AQI; six pollutants are independent variables and represented by x 1it , x 2it , x 3it , x 4it , x 5it , x 6it respectively. The dataset is given in Appendix 1.
Then the SVR-based fixed effects panel data linear model of AQI is given as follows: Apply the SVR-based fixed effects panel data forecasting model (27) to fit the panel data set, the training error E is 3.649904, and the test error is 2.506298. Compared with forecasting model of the least-squares dummy-variable, the SVRbased fixed effects panel data model has good training accuracy and test accuracy. Furthermore, forecasting model of the least-squares dummy-variable is easily affected by noise data or outlier, and because of the sparsity of SVR-based fixed effects panel data model, SVR-based panel data forecasting model is robust and isn't easily affected by noise data or outlier.
4.2.2.2. SVR-based fixed effects panel data nonlinear forecasting model. In above mentioned research (Yang 2020;Taylan 2017;Sun and Huang 2022;Xu, Liu, and Duan 2020), their researches are limited to a separate city and the data are daily average data of air pollution indicators, and the AQI forecasting model uses the panel data linear model.
In the first part of section 4.2, using daily average data of air pollution indicators as data set, the SVR-based fixed effects panel data linear forecasting model is constructed, the training accuracy and test accuracy are all good.
In practice, the daily data of air pollution indicators vary in an interval, Interval-valued data should be used to represent these air pollution indicators. Panel data for the period 1 November 2018 to 30 November 2018 are also selected as training data set, the rest panel data as test data sets.
Because the minimum (maximum) value of AQI and the minimum value (maximum value) of air pollution indicators are not occur at the same time, they don't keep linear relationship. In Min-Max panel interval-valued data regression model, suppose a panel intervalvalued data model with individual-specific effects is as follows: x L 2it , :::, x L 6it Þ þ e L it , i ¼ 1, 2, :::, 9; t ¼ 1, 2, :::, 30: (28) x U 2it , :::, x U 6it Þ þ e U it , i ¼ 1, 2, :::, 9; t ¼ 1, 2, :::, 30: Whereu : R 6 ! H is a mapping from the data set R 6 to some higher dimensional feature space H, Kðx, zÞ uðxÞ T uðzÞ is a kernel function. In this work, Gaussian radial basis kernel function Kðx, zÞ ¼ exp½Àckx À zk 2 was used, 10-fold cross validation and gird search method are used to select the optimal parameter (Jiang and Wang 2017), the parameter C ¼ 0.5,c ¼ 0:0001: To estimate the panel data nonlinear regression model (28) by solving the optimization problem (21), the panel training data set is ðx L 1it , x L 2it , :::, x L 6it ; y L it Þ i ¼ 1, 2, j È :::, 9; t ¼ 1, 2, :::, 30g, the individual-specific effects b L i can be calculated by Eq. (22), then we can obtain the minimum value of panel data nonlinear forecasting model with individualspecific effectsŷ Where x L js ¼ ðx L 1js , x L 2js , :::, x L 6js Þ, x L it ¼ ðx L 1it , x L 2it , :::, x L 6it Þ T : Similar to obtain the maximum value model of panel data nonlinear forecasting model with individual-specific effects based on the panel training data set ðx U 1it , x U 2it , :::, x U 6it ; y U it Þ i ¼ 1, 2, :::, 9; t ¼ 1, 2, :::, 30 j g : Èŷ x U 2it , :::, x U 6it Þ T and a U it , a ÃU it is the optimal solution of optimization problem of (21), b U i can be calculated by Eq. (22). The individual (city) effects b L i , b U i are given in Table 3. Apply the SVR-based fixed effects panel data nonlinear forecasting model (30), (31) to fit the panel data set in Appendix 2, the training error E of minimum value model is 3.649904, training error E of maximum value model is 4.123632, and the test error E of minimum value model is 3.357716, test error E of maximum value model is 3.102853.
The above numerical experiments show that our proposed SVR-based fixed effect panel data (linear or nonlinear) models have good training accuracy and test accuracy.

Conclusions
The estimation of nonlinear panel data models is a challenging task, this paper first extends support vector regression (SVR) algorithm to fitting several parallel super-plane simultaneously, and a novel nonparametric estimation of fixed effect panel data linear model is introduced, then using the kernel trick, a nonparametric estimation of fixed effect panel data nonlinear model is introduced. Because the solutions of SVR-based panel data model are sparse, so the algorithm is robust. At last, the SVR-based fixed effect panel data (linear or nonlinear) models are applied in air quality index (AQI) forecasting of the cities of Jing-Jin-Ji district in China. Compared with the least-squares dummy-variable (LSDV), our proposed SVR-based fixed effect panel data (linear or nonlinear) models have good training accuracy and test accuracy, it provide an effective estimation for nonlinear panel data model.