Flexible Expectile Regression in Reproducing Kernel Hilbert Spaces

ABSTRACT Expectile, first introduced by Newey and Powell in 1987 in the econometrics literature, has recently become increasingly popular in risk management and capital allocation for financial institutions due to its desirable properties such as coherence and elicitability. The current standard tool for expectile regression analysis is the multiple linear expectile regression proposed by Newey and Powell in 1987. The growing applications of expectile regression motivate us to develop a much more flexible nonparametric multiple expectile regression in a reproducing kernel Hilbert space. The resulting estimator is called KERE, which has multiple advantages over the classical multiple linear expectile regression by incorporating nonlinearity, nonadditivity, and complex interactions in the final estimator. The kernel learning theory of KERE is established. We develop an efficient algorithm inspired by majorization-minimization principle for solving the entire solution path of KERE. It is shown that the algorithm converges at least at a linear rate. Extensive simulations are conducted to show the very competitive finite sample performance of KERE. We further demonstrate the application of KERE by using personal computer price data. Supplementary materials for this article are available online.


Introduction
The expectile introduced by Newey and Powell (1987) is becoming an increasingly popular tool in risk management and capital allocation for financial institutions.Let Y be a random variable, the ω-expectile of Y , denoted as f ω , is defined by In financial applications, the expectile has been widely used as a tool for efficient estimation of the expected shortfall (ES) through a one-one mapping between the two (Taylor, 2008;Hamidi et al., 2014;Xie et al., 2014).More recently, many researchers started to advocate the use of the expectile as a favorable alternative to other two commonly used risk measures -Value at Risk (VaR) and ES, due to its desirable properties such as coherence and elicitability (Kuan et al., 2009;Gneiting, 2011;Ziegel, 2014).VaR has been criticized mainly for two drawbacks: First, it does not reflect the magnitude of the extreme losses for the underlying risk as it is only determined by the probability of such losses; Second, VaR is not a coherent risk measure due to the lack of the sub-additive property (Emmer et al., 2013;Embrechts et al., 2014).Hence the risk of merging portfolios together could get worse than adding the risks separately, which contradicts the notion that risk can be reduced by diversification (Artzner et al., 1999).Unlike VaR, ES is coherent and it considers the magnitude of the losses when the VaR is exceeded.However, a major problem with ES is that it cannot be reliably backtested in the sense that competing forecasts of ES cannot be properly evaluated through comparison with realized observations.Gneiting (2011) attributed this weakness to the fact that ES does not have elicitability.Ziegel (2014) further showed that the expectile are the only risk measure that is both coherent and elicitable.
In applications we often need to estimate the conditional expectile of the response variable given a set of covariates.This is called expectile regression.Statisticians and Econometricians pioneered the study of expectile regression.Theoretical properties of the multiple linear expectile were firstly studied in Newey and Powell (1987) and Efron (1991).Yao and Tong (1996) studied a non-parametric estimator of conditional expectiles based on local linear polynomials with a one-dimensional covariate, and established the asymptotic property of the estimator.A semiparametric expectile regression model relying on penalized splines is proposed by Sobotka and Kneib (2012).Yang and Zou (2015) adopted the gradient tree boosting algorithm for expectile regression.
In this paper, we propose a flexible nonparametric expectile regression estimator constructed in a reproducing kernel Hilbert space (RKHS) (Wahba, 1990).Our contributions in this article are twofold: First, we extend the parametric expectile model to a fully nonparametric multiple regression setting and develop the corresponding kernel learning theory.
Second, we propose an efficient algorithm that adopts the Majorization-Minimization principle for computing the entire solution path of the kernel expectile regression.We provide numerical convergence analysis for the algorithm.Moreover, we provide an accompanying R package that allows other researchers and practitioners to use the kernel expectile regression.
The rest of the paper is organized as follows.In Section 2 we present the kernel expectile regression and develop an asymptotic learning theory.Section 3 derives the fast algorithm for solving the solution paths of the kernel expectile regression.The numerical convergence of the algorithm is examined.In Section 4 we use simulation models to show the high prediction accuracy of the kernel expectile regression.We analyze the personal computer price data in Section 5.The technical proofs are relegated to an appendix.

Kernel Expectile Regression
2.1 Methodology Newey and Powell (1987) showed that the ω-expectile f ω of Y has an equivalent definition given by where (3) Consequently, Newey and Powell (1987) showed that the ω-expectile f ω of Y given the set of covariates X = x, denoted by f ω (x), can be defined as (4) Newey and Powell (1987) developed the multiple linear expectile regression based on (4).
The linear expectile estimator can be too restrictive in many real applications.Researchers have also considered more flexible expectile regression estimators.For example, Yao and Tong (1996) studied a local linear-polynomial expectile estimator with a one-dimensional covariate.However, the local fitting approach is not suitable when the dimension of explanatory variables is more than five.This limitation of local smoothing motivated Yang and Zou (2015) to develop a nonparametric expectile regression estimator based on the gradient tree boosting algorithm.The tree-boosted expectile regression tries to minimize the empirical expectile loss: where each candidate function f ∈ F is assumed to be an ensemble of regression trees.
In this article, we consider another nonparametric approach to the multiple expectile regression.To motivate our method, let us first look at the special expectile regression with ω = 0.5.It is easy to see from ( 3) and ( 4) that if ω = 0.5, expectile regression actually reduces to ordinary conditional mean regression.A host of flexible regression methods have been well-studied for the conditional mean regression, such as generalized additive model, regression trees, boosted regression trees, and function estimation in a reproducing kernel Hilbert space (RKHS).Hastie et al. (2009) provided excellent introductions to all these methods.In particular, mean regression in a RKHS has a long history and a rich success record (Wahba, 1990).So in the present work we propose the kernel expectile regression in a RKHS.
Denote by H K the Hilbert space generated by a positive definite kernel K.By the Mercer's theorem, kernel K has an eigen-expansion Some most widely used kernel functions are Other kernels can be found in Smola et al. (1998) and Hastie et al. (2009).
Given n observations {(x i , y i )} n i=1 , the kernel expectile regression estimator (KERE) is defined as where x i ∈ R p , α 0 ∈ R. The estimated conditional ω-expectile is α0 + fn (x).Sometimes, one can absorb the intercept term into the nonparametric function f .We keep the intercept term in order to make a direct comparison to the multiple linear expectile regression.
Although ( 7) is often an optimization problem in an infinite-dimensional space, depending on the choice of the kernel, the representer theorem (Wahba, 1990) ensures that the solution to (7) always lies in a finite-dimensional subspace spanned by kernel functions on observational data, i.e., for some {α i } n i=1 ⊂ R. By (8) and the reproducing property of RKHS (Wahba, 1990) we have Based on ( 8) and ( 9) we can rewrite the minimization problem (7) in a finite-dimensional space The corresponding KERE estimator is α0 + n i=1 αi K(x i , x).The computation of KERE is based on (10) and we use both ( 7) and ( 10) for the theoretical analysis of KERE.

Kernel learning theory
In this section we develop a kernel learning theory for KERE.We first discuss the criterion for evaluating an estimator in the context of expectile regression.Given the loss function φ ω , is a more appropriate evaluation measure in practice than the squared error risk defined as The reason is simple: Let f , α0 be any estimator based on the training data.By law of large number we see that and where {(x j , y j )} m j=1 is another independent test sample.Thus, one can use techniques such as cross-validation to estimate R(f, α 0 ).Additionally, the squared error risk depends on the function f * ω (x), which is usually unknown.Thus, we prefer to use R( f , α0 ) over the squared error risk.Of course, if we assume a classical regression model (when ω = 0.5) such as y = f (x) + error, where the error is independent of x with mean zero and constant variance, R( f , α0 ) then just equals the squared error risk plus a constant.Unfortunately, such equivalence breaks down for other values of ω and more general models.
After choosing the risk function, the goal is to minimize the risk.Since typically the estimation is done in a function space, the minimization is carried out in the chosen function space.In our case, the function space is RKHS generated by a kernel function K. Thus, the ideal risk is defined as Consider the kernel expectile regression estimator ( f , α0 ) as defined in (7) based on a training sample D n , where D n = {(x i , y i )} n i=1 are i.i.d.drawn from an unknown distribution.The observed risk of KERE is It is desirable to show that R( f , α0 ) approaches the ideal risk R * f,α 0 .It is important to note that R( f , α0 ) is a random quantity that depends on the training sample D n .So it is not the usual risk function which is deterministic.However, we can consider the expectation of R( f , α0 ) and call it expected observed risk.The formal definition is given as follows Expected observed risk: Our goal is to show that R( f , α0 ) converges to R * f,α 0 .We achieve this by showing that the expected observed risk converges to the ideal risk, i.e., lim n→∞ E Dn R( f , α0 ) = R * f,α 0 .By definition, we always have R( f , α0 ) ≥ R * f,α 0 .Then by Markov inequality, for any ε > 0 The rigorous statement of our result is as follows: and hence The Gaussian kernel is perhaps the most popular kernel for nonlinear learning.For the Gaussian kernel K(x, x ) = exp(− x − x 2 /c), we have M = 1.For any radial kernel with the form K(x, x ) = h( x − x ) where h is a smooth decreasing function, we see M = h(0) which is finite as long as h(0) < ∞.

Derivation
Majorization-minimization (MM) algorithm is a very successful technique for solving a wide range of statistical models (Lange et al., 2000;Hunter and Lange, 2004;Wu and Lange, 2010;Zhou and Lange, 2010;Lange and Zhou, 2014).In this section, we develop an algorithm inspired by MM principle for solving the optimization problem (10).Note that the loss function φ ω in (10) does not have the second derivative.We adopt the MM principle to find the minimizer by iteratively minimizing a surrogate function that majorizes the objective function in (10).
To further simplify the notation we write α = (α 0 , α 1 , α 2 , • • • , α n ) , and Then ( 10) is simplified to a minimization problem as where ω is given for computing the corresponding level of the conditional expectile.We also assume that λ is given for the time being.A smart algorithm for computing the solution for a sequence of λ will be studied in Section 3.3.
Our approach is to minimize (12) by iteratively update α using the minimizer of a majorization function of F ω,λ (α).Specifically, at the k-th step of the algorithm, where k = 0, 1, 2, . .., assume that α (k) is the current value of α at iteration k, we find a majorization Then we update α by minimizing Q(α | α (k) ) rather than the actual objective function : To construct the majorization function Q(α | α (k) ) for F ω,λ (α) at the k-th iteration, we use the following lemma: Lemma 1.The expectile loss φ ω has a Lipschitz continuous derivative φ ω , i.e.
where L = 2 max(1 − ω, ω).This further implies that φ ω has a quadratic upper bound Note that "=" is taken only when a = b.
Assume the current "residual" is r . By lemma 1, we obtain and the quadratic upper bound where Therefore the majorization function of F ω,λ (α) can be written as which has an alternatively form that can be written as where and 1 is an n × 1 vector of all ones.Our algorithm updates α using the minimizer of the quadratic majorization function (20): The details of the whole procedures for solving (12) are described in Algorithm 1.

Convergence analysis
Now we provide the convergence analysis of Algorithm 1. Lemma 2 below shows that the sequence (α (k) ) in the algorithm converges to the unique global minimum α of the optimization problem.
Lemma 2. If we update α (k+1) by using (23), then the following results hold: 1.The descent property of the objective function.
Algorithm 1 The algorithm for the minimization of (12).
Theorem 2. Denote by α the unique minimizer of (12) and Note that when Λ k = 0, it is just a trivial case α (j) = α for j > k.We define where Assume that n i=1 K i K i is a positive definite matrix.Then we have the following results: 2. The sequence (F ω,λ (α (k) )) has a linear convergence rate no greater than Γ, and 0 ≤ 3. The sequence (α (k) ) has a linear convergence rate no greater than Γγ max (K u )/γ min (K l ), i.e.
Theorem 2 says that the convergence rate of Algorithm 1 is at least linear.In our numeric experiments, we have found that Algorithm 1 converges very fast: the convergence criterion is usually met after 15 iterations.

Implementation
We discuss some techniques used in our implementation to further improve the computational speed of the algorithm.
Usually expectile models are computed by applying Algorithm 1 on a descending sequence of λ values.To create a sequence {λ m } M m=1 , we place M −2 points uniformly (in the log-scale) between the starting and ending point λ max and λ min such that the λ sequence length is M .
The default number for M is 100, hence λ 1 = λ max , and λ 100 = λ min .We adopt the warmstart trick to implement the solution paths along λ values: suppose that we have already obtained the solution α λm at λ m , then α λm will be used as the initial value for computing the solution at λ m+1 in Algorithm 1.
Another computational trick adopted is based on the fact that in Algorithm 1, the inverse of K u does not have to be re-calculated for each λ.There is an easy way to update K −1 u for λ 1 , λ 2 , . ... Because K u can be partitioned into two rows and two columns of submatrices, by Theorem 8.5.11 of Harville (2008), K −1 u can be expressed as where .
In (25) only Q −1 λ changes with λ, therefore the computation of K −1 u for a different λ only requires the updating of Q −1 λ .Observing that Q −1 λ is the inverse of the sum of two submatrices A and B: By Sherman-Morrison formula (Sherman and Morrison, 1950), where g = trace(BA −1 λ ), we find that to get Q −1 λ for a different λ one just needs to get A −1 λ , which can be efficiently computed by using eigen-decomposition K = UDU : (27) implies that the computation of K −1 u (λ) depends only on λ, D, U and ω.Since D, U and ω stay unchanged, we only need to calculate them once.To get K −1 u (λ) for a different λ in the sequence, we just need to plug in a new λ in (27).
The following is the implementation for computing KERE for a sequence of λ values using Algorithm 1: 25), ( 26) and ( 27).
Our algorithm has been implemented in an official R package KERE, which is publicly available from the Comprehensive R Archive Network at http://cran.r-project.org/web/packages/KERE/index.html.

Simulation
In this section, we conduct extensive simulations to show the excellent finite performance of KERE.We investigate how the performance of KERE is affected by various model and error distribution settings, training sample sizes and other characteristics.Although many kernels are available, throughout this section we use the commonly recommended (Hastie et al., 2009) Gaussian radial basis function (RBF) kernel K(x i , x j ) = e . We select the best pair of kernel bandwidth σ 2 and regularization parameter λ by two-dimensional five-fold cross-validation.All computations were done on an Intel Core i7-3770 processor at 3.40GHz.

Simulation I: single covariate case
The model used for this simulation is defined as which is heteroscedastic as error depends on a single covariate x ∼ U [−8, 8].We used a single covariate such that the estimator can be visualized nicely.
We used two different error distributions: Laplace distribution and a mixed normal distribution, ).
We generated n = 400 training observations from (28), on which five expectile models with levels ω = {0.05,0.2, 0.5, 0.8, 0.95} were fitted.We selected the best (σ 2 , λ) pair by using two-dimensional, five-fold cross-validation.We generated an additional n = 2000 test observations for evaluating the mean absolute deviation (MAD) of the final estimate.
Assume that the true expectile function is f ω and the predicted expectile is fω , then the mean absolute deviation are defined a The true expectile f ω is equal to sin(0.7x)+ x 2 20 + |x|+1 5 b ω ( ), where b ω ( ) is the ω-expectile of The simulations were repeated for 100 times under the above settings.We recorded MADs for different expectile levels in Table 1.We find that the accuracy of the expectile prediction with mixed normal errors is generally better than that with Laplace errors.For the symmetric Laplace case, the prediction MADs are also symmetric around ω = 0.5, while for the skewed mixed-normal case the MADs are skewed.In order to show that KERE works as expected, in Figure 1 we also compared the theoretical and predicted expectile curves based on KERE with ω = {0.05,0.2, 0.5, 0.8, 0.95} in Figure 1.We can see that the corresponding theoretical and predicted curves are very close.

Simulation II: multiple covariate case
In this part we illustrate that KERE can work very well for target functions that are nonadditive and/or with complex interactions.We generated data {x i , y i } n i=1 according to where predictors x i was generated from a joint normal distribution N (0, I p ) with p = 10.
For the error term i we consider three types of distributions: 1. Normal distribution i ∼ N (0, 1).
2. Student's t-distribution with four degrees of freedom i ∼ t 4 .
We now describe the construction of f 1 and f 2 .In the homoscedastic model, we let f 2 (x i ) = 1 and f 1 is generated by the "random function generator" model (Friedman, 2000), according to where {a l } 20 l=1 are sampled from uniform distribution a l ∼ U [−1, 1], and x l is a random subset of p-dimensional predictor x, with size p l = min( 1.5 + r, p ), where r was sampled from exponential distribution r ∼ Exp(0.5).The function g l (x l ) is an p l -dimensional Gaussian function: where µ l follows the distribution N (0, I p l ).The p l × p l covariance matrix V l is defined by , where U l is a random orthogonal matrix, and In the heteroscedastic model, f 1 is the same as in the homoscedastic model and f 2 is independently generated by the "random function generator" model.
An additional test set with n = 1200 observations was generated for evaluating MADs between the fitted expectile fω and the true expectile f ω .Note that the expectile function In Figure 2 and 3 we show the box-plots of empirical distributions of MADs, and in Table 2 we report the average values of MADs and corresponding standard errors.We see that KERE can deliver accurate expectile prediction results in all cases, although relatively the prediction error is more volatile in the heteroscedastic case as expected: in the mean regression case (ω = 0.5), the averaged MADs in homoscedastic and heteroscedastic models are very close.But this difference grows larger as ω moves away from 0.5.We also observe that the prediction MADs for symmetric distributions, normal and t 4 , also appear to be symmetric around the conditional mean ω = 0.5, and that the prediction MADs in the skewed mixed-normal distribution cases are asymmetric.The total computation times for conducting two-dimensional, five-fold cross-validation and fitting the final model with the chosen parameters (σ 2 , λ) for conditional expectiles are also reported in Table 3.We find that the algorithm can efficiently solve all models under 20 seconds, regardless of choices of error distributions.
We next study how sample size affects predictive performance and computational time.
We fit expectile models with ω ∈ {0.1, 0.5, 0.9} using various sizes of training sets n ∈ {250, ω Mean absolute deviation (MAD) 0.5 1.0  only the result from the heteroscedastic model with mixed-normal error is presented.We find that the sample size strongly affects predictive performance and timings: large samples give models with higher predictive accuracy at the expense of computational cost -the timings as least quadruple as one doubles sample size.

Real data application
In this section we illustrate KERE by applying it to the Personal Computer Price Data studied in Stengos and Zacharias (2006).The data collected from the PC Magazine from January of 1993 to November of 1995 has 6259 observations, each of which consists of the advertised price and features of personal computers sold in United States.There are 9 main price detriments of PCs summarized in Table 5.The price and the continuous variables except the time trend are in logarithmic scale.We consider a hedonic analysis, where the price of a product is considered to be a function of the implicit prices of its various components, see Triplett (1989).The intertemporal effect of the implicit PC-component prices is captured by incorporating the time trend as one of the explanatory variables.The presence of non-linearity and the interactions of the components with the time trend in the data, shown by Stengos and Zacharias (2006), suggest that the linear expectile regression may lead to a misspecified model.Since there lacks of a general theory about any particular  (Stengos and Zacharias, 2006) functional form for the PC prices, we use KERE to capture the nonlinear effects and higher order interactions of characteristics on price and avoid severe model misspecification.
We randomly sampled 1/10 observations for training and tuning with two-dimensional five-fold cross-validation for selecting an optimal (σ 2 , λ) pair, and the remaining 9/10 observations as the test set for calculating the prediction error defined by For comparison, we also computed the prediction errors using the linear expectile regression models under the same setting.All prediction errors are computed for seven expectile levels ω ∈ {0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95}.We repeated this process 100 times and reported the average prediction error and their corresponding standard errors in Table 6.
We also showed box-plots of empirical distributions of prediction errors in Figure 4 6: The averaged prediction error and the corresponding standard errors for the Personal Computer Price Data based on 100 independent runs.The expectile levels are ω ∈ {0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95}.The numbers in this table are of the order of 10 −3 .Appendix: Technical Proofs

Some technical lemmas for Theorem 1
We first present some technical lemmas and their proofs.These lemmas are used to prove Theorem 1.
The solution to (10) can be alternatively obtained by solving the optimization problem where g is defined by Proof.Let α = (α 1 , α 2 , . . ., α n ) .Since both objective functions in ( 10) and ( 29) are convex, we only need to show that they share a common stationary point.Define By setting the derivatives of ( 10) with respect to α to be zero, we can find the stationary point of (10) satisfying and setting the derivative of (10) with respect to α 0 to be zero, we have Combining ( 31) and ( 32), ( 32) can be to In comparison, the Lagrange function of ( 29) is The first order conditions of (34) are and Noting that 2λφ * ω (α i ) = φ * ω (2λα i ) and φ * ω is the inverse function of φ ω .Let ν = α 0 , then (31) and ( 35) are equivalent.Therefore, (10) and ( 29) have a common stationary point and therefore a common minimizer.
Lemma 5.For the g function defined in (30), we have Proof.It is clear that the second derivative of g is bounded above by K + λ min(1−ω,ω) I and bounded below by Hence when α and α are fixed and g ( α) = 0, the maximum of g(α) − g( α) is obtained when the second order derivative of g achieves its maximum and the minimum is obtained when the second order derivative achieves its minimum.
The next lemma establishes the basis for the so-called leave-one-out analysis (Jaakkola and Haussler, 1999;Joachims, 2000;Forster and Warmuth, 2002;Zhang, 2003).The basic idea is that the expected observed risk is equivalent to the expected leave-one-out error.Let i=1 be a random sample of size n + 1, and let n+1 be the subset of D n+1 with the i-th observation removed, i.e.
α[i] 0 ) be the estimator trained on D [i] n+1 .The leave-one-out error is defined as the averaged prediction error on each observation (x i , y i ) using the estimator ( f n+1 , where (x i y i ) is excluded: Leave-one-out error: ) is equivalent to the expected leave-one-out error on where Proof.
Denote as ( f(n+1) , α0 (n+1) ) the KERE estimator trained from n + 1 samples D n+1 .The estimates αi for 1 ≤ i ≤ n + 1 are defined by f(n+1 Part I We first show that the leave-one-out estimate is sufficiently close to the estimate fitted from using all the training data.Without loss of generality, just consider the case that the (n + 1)th data point is removed.The same results apply to the other leave-one out cases.

We show that
, where the expression of is to be derived in the following.
We first study the upper bound for By the definitions of g in ( 30) and (α That is, Denote for simplicity that α[n+1] n+1 = 0. Applying Lemma 5 to both LHS and RHS of the above inequality, we have Combining it with the bound for |α n+1 | by Lemma 7 (note that here αn+1 is trained on n + 1 samples), we have where and Combining ( 43) with Lemma 4, we have that for 1 Next, we bound |α By the Lipschitz continuity of φ ω we have and by applying ( 46) and ( 47) we have the upper bound Similarly, by ( 41), (42), and (48) we have where the second last inequality follows from ( 41) and ( 42).Note that in this case the corresponding sample is n + 1.

Now we only need to show that
3 → 0 as n → ∞.In fact we can show In the following analysis, C represents any constant that does not depend on n, but the value of C varies in different expressions.Let V i = q 1 Y n+1 1 n+1 + M (q 1 + 1) q 2 λ Y n+1 2 + |y i |, then as n → ∞, 4M 2 < λ(n+1) 2nq 3 , and we have the upper bound and since n > √ λ asymptotically, we have Then We can bound V i as follows: Then we have Combining it with (61) and using the assumption Ey 2 i < D, have 3 → 0. This completes the proof of Theorem 1.
2. We obtain a lower bound for F ω,λ ( α) and majorization Q( α | α (k) ) k) ).( 70) Subtract ( 69) from ( 70) and divide by ( α − α (k) ) K u ( α − α (k) ), we have Both K u and K l are positive definite by the assumption that n i=1 K i K i is positive definite, and since u , the matrix K −1 u K l is similar to the matrix K By ( 14) and ( 71) we showed that 0 ≤ Λ k ≤ Γ < 1.
the homoscedastic model and f 1 (x) + |f 2 (x)|b ω ( ) in the heteroscedastic model, where b ω ( ) is the ω-expectile of the error distribution.Under the above settings, we repeated the simulations for 300 times and record the MAD and timing each time.

Figure 4 :
Figure4: Prediction error distributions for the Personal Computer Price Data using the linear expectile model and KERE.Box-plots show prediction error based on 100 independent runs for expectiles ω ∈ {0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95}.The numbers in this table are of the order of 10 −3 .

Table 1 :
Theoretically the two should become the same curves as n → ∞.The averaged MADs and the corresponding standard errors of expectile regression predictions for single covariate heteroscedastic models with mixed normal and Laplace error.The models are fitted on five expectile levels ω = {0.05,0.2, 0.5, 0.8, 0.95}.The results are based on 300 independent runs.

Table 2 :
The averaged MADs and the corresponding standard errors for fitting homoscedastic and heteroscedastic models based on 300 independent runs.The expectile levels are ω ∈ 1000} and evaluate the prediction accuracy of the estimate using an independent test set of size n = 2000.We then report the averaged MADs and the corresponding averaged timings in Table4.Since the results are very close for different model settings,

Table 5 :
Explanatory variables in the Personal Computer Price Data ) when (t+t ) and t are both positive or both negative, (53) follows from (t+t ) 2 −t 2 = 2tt +t 2 .When t + t and t have different signs, it must be that |t | < |t|, and we have |t| = |t + t | + |t | and hence |t + t | < |t|.Then (53) is proved by φ