Asset Pricing via the Conditional Quantile Variational Autoencoder

Abstract We propose a new asset pricing model that is applicable to the big panel of return data. The main idea of this model is to learn the conditional distribution of the return, which is approximated by a step distribution function constructed from conditional quantiles of the return. To study conditional quantiles of the return, we propose a new conditional quantile variational autoencoder (CQVAE) network. The CQVAE network specifies a factor structure for conditional quantiles with latent factors learned from a VAE network and nonlinear factor loadings learned from a “multi-head” network. Under the CQVAE network, we allow the observed covariates such as asset characteristics to guide the structure of latent factors and factor loadings. Furthermore, we provide a two-step estimation procedure for the CQVAE network. Using the learned conditional distribution of return from the CQVAE network, we propose our asset pricing model from the mean of this distribution, and additionally, we use both the mean and variance of this distribution to select portfolios. Finally, we apply our CQVAE asset pricing model to analyze a large 60-year US equity return dataset. Compared with the benchmark conditional autoencoder model, the CQVAE model not only delivers much larger values of out-of-sample total and predictive R2’s, but also earns at least 30.9% higher values of Sharpe ratios for both long-short and long-only portfolios.


Introduction
Following the arbitrage pricing theory in Ross (1976), the standard asset pricing model assumes that the excess return r i,t of individual asset i at time t has a K-factor structure where i = 1, . .., N, t = 1, . .., T, f t is a K-dimensional vector of risk factors, β i is a K-dimensional vector of factor exposures, and u i,t is the idiosyncratic error.Model (1) conforms to the modern asset pricing theory that higher expected returns are compensated by factor risk exposures.When f t is observed, the challenge on static betas in (1) has emerged from empirical evidences demonstrated by Ghysels (1998), Boguth et al. (2011), Gagliardini, Ossola, and Scaillet (2016), and many others.When f t is latent, Kelly, Pruitt, and Su (2019) propose the instrumented principal component analysis (IPCA) method to estimate f t , based on the following asset pricing model with dynamic betas where the K-dimensional conditional factor exposure β(z i,t−1 ) is a function of a P-dimensional vector of asset characteristics z i,t−1 , and P with a potential high dimension is strictly greater than K.In model (2), the observed asset characteristics serve as instrumental variables in selecting latent factors and estimating 2018) or estimated factors by the PCA (Bai and Ng 2002;Bai 2003), and the model (2) with the linear beta function in (3) estimated by the IPCA (Kelly, Pruitt, and Su 2019).
Although the CAE model achieves the success empirically, it overlooks many important stylized features in financial data, such as heavy-tailedness, conditional heteroscedasticity, and heterogeneity in quantiles.It is expected that these features could help us to better explain and predict the conditional mean of r i,t .Motivated by this argument, we attempt to approximate the conditional distribution of r i,t , denoted by F i,t (r) P(r i,t ≤ r|z i,t−1 , f t ), via the following step function where τ * 1 , . . ., τ * J are predetermined (adjustable) fractions satisfying 0 < τ * 1 < • • • < τ * J−1 < τ * J = 1, and {Q * i,t,j } with Q * i,t,J+1 = ∞ are nondecreasing random variables with respect to j.Clearly, F * i,t (r) is a discrete distribution function almost surely.In this article, we select where τ j = (τ * j−1 + τ * j )/2 for j = 1, . .., J with τ * 0 = 0, and Q i,t (τ ) is the τ th conditional quantile of r i,t given z i,t−1 and f t .This selection of Q * i,t,j is triggered by the fact that almost surely, the 1-Wasserstein metric between F i,t (r) and F * i,t (r) is uniquely minimized under (5) when the values of τ * j , j = 1, . .., J, are given and F i,t (r) is continuous (Dabney et al. 2018).To a certain extent, our nonparametric estimation strategy of F i,t (r) could avoid the risk of mis-specification on the conditional distribution function of r i,t in the parametric ones (Chen, Fernández-Val, and Weidner 2021).
Under (4)-( 5), the step distribution function F * i,t (r) becomes Our asset pricing model uses the mean of F * i,t (r) in (6) to explain the conditional mean of r i,t .Clearly, the formulation of the mean of F * i,t (r) boils down to that of Q i,t (τ ) for τ = τ 1 , . .., τ J .To achieve this goal, we propose a new nonlinear quantile factor model with dynamic loadings defined as where f t is a K-dimensional vector of latent factors, and β τ (•) is defined in the same way as β(•) in (2), except that it is quantile-dependent.To study J different conditional quantiles Q i,t (τ ) for τ = τ 1 , . .., τ J in (7), we introduce a new conditional quantile variational AE (CQVAE) neural network.Our CQVAE network contains two modules.Its first module is a factor network, which learns f t by using the VAE method (Kingma and Welling 2014).The VAE is an artificial neural network which approximates the true posterior probability of latent variables by a user-chosen variational probability.In the factor network, we assume the N-dimensional vector of cross-sectional returns r t = (r 1,t , . .., r N,t ) is driven by a K-dimensional vector of latent variables s t , and approximate the true posterior probability of s t by a normal variational probability with both mean and variance generated by a neural network having r t as the input.
Since K is often much smaller than N, the VAE achieves the dimension reduction by carrying the compressed information of a high-dimensional r t through a low-dimensional normal variational probability, the mean of which is then taken as our latent factor f t .Particularly, our factor network allows the asset characteristics to guide the structure of f t , when the input r t is replaced by a set of managed portfolios which are re-weighted portfolios by the asset characteristics (Kozak, Nagel, and Santosh 2020).
The second module of the CQVAE network is a beta network, which uses the "multi-head" network structure (Mnih et al. 2015) to learn J different beta functions β τ (z i,t−1 ) for τ = τ 1 , . .., τ J .In the beta network, the asset characteristics {z i,t−1 } N i=1 at time t − 1 are taken as its input, and all β τ (z i,t−1 ) are its output from a hidden layer, which has the quantileindependent weight matrix and bias vector for its input implementation while J different quantile-dependent weight matrices and bias vectors for its output implementation.Through this beta network, a nonlinear mapping from asset characteristics to J different quantile-dependent beta functions is specified, and together with f t from the factor network, the CQVAE network enables the dimension reduction across different quantile levels.In line with the two-module structure of CQVAE network, a two-step estimation procedure is given to estimate the latent factor f t and nonlinear beta functions β τ (z i,t−1 ) for τ = τ 1 , . .., τ J .
When Q i,t (τ ) for τ = τ 1 , . .., τ J satisfy the CQVAE network, the related model constructed from the mean of F * i,t (r) in ( 6) is called the CQVAE asset pricing model.The core idea of the CQVAE model is to learn the conditional mean of r i,t from the conditional distribution of r i,t .This is different from the existing asset pricing models which apply some linear/nonlinear models to learn the conditional mean of r i,t directly (Fama and French 1993, 2015, 2018;Griffin 2002;Hou, Karolyi, and Kho 2011;Kelly, Pruitt, and Su 2019;Gu, Kelly, and Xiu 2021).There are two advantages of the CQVAE model from the study of conditional distribution of r i,t by using F * i,t (r) in (6).First, the CQVAE model accounts for many important stylized features aforementioned through F * i,t (r), so it is anticipated to have good performances under various complex scenarios.Second, as a byproduct, the CQVAE model can also explain the conditional variance and quantiles of r i,t through F * i,t (r), which are crucial elements to select portfolios in terms of the mean-variance criterion (Markowitz 1952) or investigate some well-known quantilebased risk measures such as Value-at-Risk and Expected Shortfall (McNeil, Frey, and Embrechts 2015).Compared with the CAE model in Gu, Kelly, and Xiu (2021), the CQVAE model has an additional advantage to alleviate overfitting.This is because the VAE used by the CQVAE model regularizes the latent space via the normal variational probability, whereas the AE adopted by the CAE model imposes no constraints on the latent space.See, for example, Doersch (2016) for more discussions on this aspect.
We apply our CQVAE model to study all monthly US equity returns from March 1957 to December 2016, consisting of a large dataset with the total number of stocks N = 31,925 during T = 720 months.Meanwhile, this dataset also contains P = 94 different asset characteristics for each stock at every month, and these asset characteristics are used to guide the structure of latent factors and beta functions.Our analysis shows that the CQVAE model significantly outperforms the CAE model in terms of several out-of-sample evaluations.Specifically, from the statistical viewpoint, the best out-of-sample total and predictive R 2 (Kelly, Pruitt, and Su 2019) for individual returns from the CQVAE model are 17.7% and 1.86%, which are 34.6% and 181.8% higher than those from the CAE model, respectively.Moreover, from the economic viewpoint, the CQVAE model earns the best annualized Sharpe ratios of 3.25 and 2.22 for the equal weighted long-short and long-only portfolios, respectively, whereas the respective Sharpe ratios are 2.41 and 1.68 for the CAE model.Particularly, the CQVAE model has a much more stable out-of-sample performance than the CAE model with respect to the value of K (i.e., the number of latent factors).This stability advantage indicates that the CQVAE model can well balance the tradeoff between model flexibility and implementation difficulty, since it only loses model flexibility marginally by taking a small value of K.
In its own rights, model ( 7) contributes to a burgeoning literature on panel quantile models, which aim to analyze the quantile co-movement for a large number of financial asset returns.When where b i,τ is a P-dimensional vector of regression coefficients, f t is a K-dimensional vector of latent factors for K = K − 1, and λ τ (•) is a K-dimensional vector of functional factor loadings.Due to the presence of interactive term λ τ (z i,t−1 ) f t , model (8) can be viewed as a panel quantile regression with interactive effects, which nests those panel quantile regressions in Koenker (2004) and Kato, Galvao, and Montes-Rojas (2012).
In this model, the individual effects λ τ (z i,t−1 ) in the interactive term are functions of time-varying covariants z i,t−1 .These kinds of individual effects are more general than those in the panel quantile regression in Ando and Bai (2020), which assumes the individual fixed effects (say, λ τ ,i ) in the interactive term, as similarly done by Bai (2009).Particularly, if the linear term z i,t−1 b i,τ is absent, model ( 8) is indeed a semiparametric quantile factor model, which nests the quantile factor model in Chen, Dolado, and Gonzalo (2021).Compared with the existing semiparametric quantile factor model in Ma, Linton, and Gao (2021), the main difference is that model (8) allows each factor loading (that is, each entry of λ τ (•)) to be a function of many timevariant covariants, whereas the model in Ma, Linton, and Gao (2021) only allows that to be a function of one univariate timeinvariant covariant.Note that we restrict latent factors in model (8) (or model ( 7)) to be quantile-independent for reducing the computational burden, and we can extend our model with slight modifications to allow quantile-dependent latent factors as in Ando and Bai (2020), Chen, Dolado, and Gonzalo (2021), and Ma, Linton, and Gao (2021).
Clearly, the advantage of model ( 7) over the models in Ando and Bai (2020), Chen, Dolado, and Gonzalo (2021), and Ma, Linton, and Gao (2021) is its general structure of factor loadings.This general structure is empirically motivated by Gu, Kelly, and Xiu (2021), and it enhances the capability of mining the hidden information from a large dimension of covariants with the use of machine learning methods.Although model ( 7) has a general specification, it lacks theoretical results to ensure the consistency of its machine learning estimation strategy.This drawback is common in most machine learning methods, and we gain model flexibility as a tradeoff.To overcome this drawback, one has to lose certain degrees of model flexibility by restricting the structure of factor loadings as done in Ando and Bai (2020), Chen, Dolado, andGonzalo (2021), andMa, Linton, andGao (2021), so that the transparent asymptotic results on the consistency of model estimation can be obtained.
The remaining article is organized as follows.Section 2 presents our entire methodology, including the architecture of the CQVAE network, the estimation of the CQVAE network, the formal specification of the CQVAE asset pricing model, and some regularization implementation details.Section 3 gives our empirical studies on U.S. equity market.Concluding remarks are offered in Section 4. Some additional data analysis and simulation studies are deferred into the supplementary materials.

The CQVAE Network for Conditional Quantiles
As shown in (4)-( 5), the estimation of F * i,t (r) depends on how we estimate J different conditional quantiles Q i,t (τ ) for τ = τ 1 , . .., τ J in (7).To achieve this goal, we design a new CQVAE neural network for these J different conditional quantiles.
Our CQVAE network amounts to a combination of two modules.In the first module, we assume that the vector of crosssectional returns r t ∈ R N×1 is driven by a vector of latent variables s t ∈ R K×1 , and apply the VAE to learn the latent factor f t from the approximated posterior distribution of s t .Specifically, r t as the input of the VAE has the following marginal likelihood In terms of the language of coding theory, the latent variable s t is regarded as a code, and q φ (s t |r t ) is a probabilistic encoder, since given r t it produces a distribution over the possible values of the code s t from which this r t could have been generated.Let N (x; μ, ) denote the probability of the K-dimensional multivariate normal distribution with mean vector μ ∈ R K×1 and variance-covariance matrix ∈ R K×K .We follow Kingma and Welling (2014) to choose where μ(r t ; φ) = (μ 1 (r t ; φ), . .., μ K (r t ; φ)) and σ 2 (r t ; φ) = (σ 2 1 (r t ; φ), . .., σ 2 K (r t ; φ)) that are outputs of a neural network with one hidden layer and the input r t : where are vectors of bias parameters, d 1 is the number of units in the hidden layer, φ contains all the parameters in {W i , b i } 3 i=1 , and tanh(•) is entry-wise vector-valued activation function.Based on (10), we can use the so-called "reparameterization trick" in Kingma and Welling (2014) to sample s t by That is, we can sample s t by just sampling t , where t is viewed as the stochastic input of the VAE with a standard multivariate normal distribution depending not on any unknown parameters.As discussed in Kingma and Welling (2014), the use of "reparameterization trick" is key to implement stochastic gradient decent via backpropagation for the parameter estimation.
Accompanied by the probabilistic encoder q φ (s t |r t ), the VAE need simultaneously specify a probabilistic decoder p θ (r t |s t ), which produces a distribution over the possible values of r t corresponding to a given code s t .As in Kingma and Welling (2014), we choose where the value of hyper-parameter σ 2 can be trained but conventionally taken to 1/2, and μ(s t ; θ ) is output of a neural network with one hidden layer and the input s t : where is the number of units in the hidden layer, and θ contains all the parameters in {W i , b i } 5 i=4 .Under ( 12), μ(s t ; θ ) is the conditional mean of r t given s t , and it is taken as the output of the VAE, which attempts to approximate the input r t under the L 2 loss with certain regularization (see Section 2.2 for more details).
The architecture of the VAE described above is shown on the left side of Figure 1, while that of the AE with one hidden layer is given on the right side of this figure as a comparison.As illustrated in Figure 1, both VAE and AE aim to compress the high-dimensional input into a low-dimensional set of neurons in one hidden layer (encoding), which is then unpacked and mapped to the output layer (decoding).In the VAE, the mean and variance neurons in the corresponding hidden layer represent the posterior distribution of latent variables, which carries the compressed information of the input.However, the neurons in the hidden layer of the AE are the latent variables themselves, a set of which forges a compressed representation of the input.Due to this difference, we take the mean neurons as our latent factors f t in (7), whereas Gu, Kelly, and Xiu (2021) use the neurons in the hidden layer of the AE as the latent factors in their CAE model.It is worth noting that this difference also makes the VAE largely circumvent the overfitting problem in the AE.This is because the VAE regularizes the latent space (via q φ (s t |r t ) in ( 10)) to have an interpretable and exploitable structure, whereas the AE imposes no constraints on the latent space so that it tends to reconstruct data without information loss, causing a severe overfitting problem.
Next, we introduce the second module of our CQVAE network.In this module, we specify the "multi-head" structure on β τ (z i,t−1 ).Specifically, β τ (z i,t−1 ) = β τ (z i,t−1 ; ψ), for τ = τ 1 , . .., τ J , are indexed by a vector of unknown parameters ψ, and they are the outputs of a neural network with one hidden layer and the input z i,t−1 : where W 6,τ ∈ R K×d  .., β τ J (z t−1 )} (in green) depend on a P-dimensional vector of asset characteristics z t−1 (in purple) through a "multi-head"neural network with one hidden layer (in gray).Each row of green nodes represents a K-dimensional vector of factor loadings β τ j (z t−1 ) at one quantile level τ j .The factor network (right panel) describes how the latent factors f t are obtained from an Ndimensional vector of individual asset returns r t (in orange) via probabilistic encoder of VAE (in gray).The pink nodes in output layer are J different conditional quantiles computed by multiplying each row from the beta network with the vector of latent factors (in yellow) from the factor network.
Combining the two modules above, Figure 2 displays the architecture of our CQVAE network.Its left side is the beta network that models J different nonlinear quantile-dependent beta functions on asset characteristics, while its right side is the factor network that models the latent factors by using the posterior mean of the latent variables in the VAE.Finally, the "dotted operation" multiplies each beta function from the beta network with the vector of latent factors from the factor network to produce J different conditional quantiles as the outputs of our CQVAE network.
So far, our CQVAE network only uses the information of asset characteristics to guide the structure of beta functions.To further allow the asset characteristics to guide the structure of latent factors, we can follow Gu, Kelly, and Xiu (2021) to replace the input r t by a P × 1 vector of managed portfolios, defined as where Z t−1 is an N × P matrix with ith row z i,t−1 , and x t contains P different portfolios that are dynamically re-weighted (or "managed") based on the asset characteristics.Similar to the discussions in Gu, Kelly, and Xiu (2021), we have three advantages to use x t as the input: First, the number of parameters in the CQVAE network is reduced significantly, since the dimension of x t is much smaller than that of r t ; Second, the unbalance issue of the return panel {r i,t } caused by the missing stocks at times is largely bypassed; Third, the usefulness of x t to determine the latent factors is anticipated, since the characteristic-managed portfolios have shown their importance to recast the conditional linear factor models (Kelly, Pruitt, and Su 2019; Feng, Giglio, and Xiu 2020; Kozak, Nagel, and Santosh 2020; Giglio and Xiu 2021).

The Estimation of the CQVAE Network
In accordance with the two-module structure of the CQVAE network, we estimate this network by a two-step estimation procedure.Without loss of generality, we assume the input of factor network is r t below, since a similar estimation procedure holds when the input is x t in (15).At the first step, we estimate the latent factors (i.e., the mean neurons in the VAE) by using the evidence lower bound (ELBO) method in Kingma and Welling (2014).To elaborate the idea of ELBO, we consider the following Kullback-Leibler (KL) divergence of q φ (s t |r t ) from p θ (s t |r t ) where D KL (p 1 p 2 ) is the KL divergence of probability p 1 from probability p 2 .Clearly, the VAE ought to minimize the KL divergence in ( 16), which measures the approximation error from replacing p θ (s t |r t ) by q φ (s t |r t ).Using (9) and rearranging the terms in ( 16), we can get Since D KL q φ (s t |r t ) p θ (s t |r t ) ≥ 0, the right hand side of ( 17) is called the ELBO on the marginal likelihood of datapoint t.From (17), we know that maximizing the ELBO enables us to minimize D KL q φ (s t |r t ) p θ (s t |r t ) while simultaneously maximize log p θ (r t ).
Next, we show how to estimate the ELBO.Following Kingma and Welling (2014), we take the prior probability of s t as Under ( 10) and ( 18), the second term of ELBO in (17) becomes however, the first term of ELBO in (17) does not have a closed form and has to be estimated by using Monte Carlo method below where s 12) and ( 19)-( 20), we obtain the Stochastic Gradient Variational Bayes (SGVB) estimator of the ELBO for datapoint t − ¯ (r t ; θ , φ) + a constant, where with μ k and σ 2 k (the entries of μ and σ 2 ) defined in ( 11) and μ defined in (13).
As argued before, our goal is to estimate θ and φ by maximizing the ELBO, and this now can be fulfilled by minimizing L(θ , φ) = T t=1 ¯ (r t ; θ , φ) based on the SGVB estimators of the ELBO for full datapoints.To optimize L(θ , φ), we adopt the adaptive moment estimation (Adam) algorithm in Kingma and Ba (2015) to evaluate the gradient from a minibatch (that is, a small random subset of the full data) with batch size M at each iteration.Since the batch size M is usually large enough, the optimization of L(θ , φ) can be effectively implemented by simply taking L = 1 (Kingma and Welling 2014).Therefore, we set L = 1 and omit the superscript (l) in ( 21) to construct our SGVB estimators of θ and φ where s t = μ(r t ; φ) + diag(σ 2 (r t ; φ)) 1/2 t with t ∼ N (0, I K ), and θ and φ are computed according to the steps in Algorithm 1.Using φ, we are now ready to estimate the latent factor f t in (7) by f t , where Note that the objective function in ( 22) contains two terms.The first term measures the reconstruction error r t − μ(s t ; θ ) under the L 2 loss, while the second term coming from the KL divergence D KL q φ (s t |r t ) p θ (s t |r t ) acts as a regularizer.The presence of this intrinsic regularizer balances the model complexity and reconstruction error, shedding light on the advantage of VAE over AE to alleviate overfitting.
Algorithm 1 Minibatch version of the SGVB estimators.
We should highlight that the multivariate normal q φ (s t |r t ) is the key to get a simple closed form of D KL q φ (s t |r t ) p θ (s t ) , leading to an easy-to-implement loss function in (22).One may choose some multivariate nonnormal distributions for q φ (s t |r t ) to capture the stylized features of returns such as skewness and heavy-tailedness.However, this leads to several serious deficiencies in terms of estimation and prediction.For example, if we assign a spike and slab distribution to q φ (s t |r t ), the calculation of stochastic gradient decent via backpropagation to estimate the VAE becomes infeasible due to the presence of the Dirac function.Meanwhile, if we assign a continuous nonnormal distribution (e.g., the multivariate t distribution) to q φ (s t |r t ), the Monte Carlo approximation of D KL q φ (s t |r t ) p θ (s t ) is inevitably needed.But this approximation makes the estimation unstable, so that the resulting CQVAE network most likely has an unsatisfactory performance of prediction.Note that our simulation studies in the supplementary materials show that the CQVAE network with the multivariate normal q φ (s t |r t ) performs well under the nonnormal distribution of factors (and returns).Therefore, using multivariate normal q φ (s t |r t ) to form the CQVAE network not only brings us the convenience in estimation, but also gives us the desirable capacity to handle the nonnormal returns.
At the second step, we estimate the dynamic factor loadings, based on the estimated latent factor f t .Since Q i,t (τ ) in ( 7) is the τ th conditional quantile of r i,t given z i,t−1 and f t , it is natural to estimate ψ in ( 14) by the following regularized quantile estimator where ρ τ (x) = x[τ − 1(x < 0)] is the check function (Koenker and Bassett 1978), ψ 1 is the l 1 penalty function to regularize beta functions, and λ > 0 is a tuning parameter.Algorithm 2 shows the details on how to compute ψ.Here, due to the massive data volume, we apply "batch normalization" technique (Ioffe and Szegedy 2015) to control the variability of network inputs across minibatches, and adopt the Adam algorithm as in Algorithm 1 to solve the optimization problem in (24).Using ψ, we then can estimate β τ (z i,t−1 ) by β τ (z i,t−1 ) for τ = τ 1 , . .., τ J , where β τ (z i,t−1 ) = β τ (z i,t−1 ; ψ).Together with f t in (23), we obtain which are our two-step estimators of Q i,t (τ ) in ( 7).
Algorithm 2 Minibatch version of the regularized quantile estimator.

The CQVAE Asset Pricing Model
Our CQVAE asset pricing model relies on using the step distribution function F * i,t (r) in (4) to approximate F i,t (r).To measure the corresponding approximation error, we consider the 1-Wasserstein (W 1 ) metric between F i,t (r) and F * i,t (r)  5).Thus, when the values of τ * j , j = 1, . .., J, are given, we can form the optimal F * i,t (r) in ( 6) to approximate F i,t (r) under the W 1 metric loss; see one illustrating example in Figure 3.In this case, the conditional mean of r i,t can be approximated by the mean of the distribution F * i,t (r) in ( 6), defined as Formally, our CQVAE asset pricing model uses μ * i,t in (26) to explain the conditional mean of r i,t , where the conditional quantiles Q i,t (τ ) for τ = τ 1 , . .., τ J satisfy the CQVAE network structure.Plugging in the estimators of Q i,t (τ ) for τ = τ 1 , . .., τ J in (25), μ * i,t can be directly estimated by where it is convenient to set τ * j = j/J for j = 0, . .., J. Unlike the existing asset pricing models, our CQVAE model explains the conditional mean of r i,t (through μ * i,t ) based on an approximated conditional distribution of r i,t in (6) rather than some linear/nonlinear conditional mean specifications of r i,t .The exploration of the conditional distribution of r i,t gives us three major advantages.First, the resulting CQVAE asset pricing model accounts for many important stylized features of r i,t , such as heavy-tailedness, conditional heteroscedasticity, and heterogeneity in quantiles, which can not be captured adequately by those models based on the linear/nonlinear conditional mean specifications of r i,t .Second, the conditional variance of r i,t can be approximated by which is the variance of the distribution F * i,t (r) in ( 6) and can be similarly estimated as in ( 27).Using the estimators of μ * i,t and h * i,t , we are able to select portfolios under the mean-variance criterion.Third, the conditional quantiles Q i,t (τ ) in ( 6) and their estimators are useful to study many quantile-based risk measures (e.g., Value-at-Risk and Expected Shortfall), which are well-known important tools for risk management.The portfolio selection method and quantile-based risk measures above are attractive in many ways, since they work for the very large values of T and N, avoid the formulation and estimation of highdimensional conditional mean-variance time series models, and take the information of asset characteristics (or other observed covariants) into account.
One natural extension of our CQVAE asset pricing model is to specify a more general network for Q i,t (τ ) in ( 7).So far, our CQVAE network accounts for the quantile effect only through the beta function but not the latent factors.This setting is different from the quantile panel models in Ando and Bai (2020), Chen, Dolado, andGonzalo (2021), andMa, Linton, andGao (2021), where the latent factors are assumed to be quantiledependent.Thus, a more general specification of Q i,t (τ ) can be set as where β τ (•) has the beta network structure as before, while f t,τ is quantile-dependent.To learn f t,τ , we can modify the VAE network in Section 2.1 by using a different probabilistic decoder where μ(s t ; θ ) is defined as in ( 13) with ith entry μi (s t ; θ ), σ > 0 is a hyper parameter, and AL(r i,t ; μi (s t ; θ ), τ , σ ) with AL(x; x 0 , τ , σ ) being the asymmetric Laplace density (Komunjer 2005;Bera et al. 2015) defined as Under (30) with σ = 1 for simplicity, we can adopt the similar idea as for ( 22) to propose the SGVB quantile estimators of θ and φ: In this case, the above estimation method indicates that the output μ(s t ; θ ) attempts to approximate the input r t under the entry-wise τ th quantile loss with the regularization as in ( 22).
Using φ τ , we then estimate f t,τ in (29) by f t,τ = μ(r t ; φ τ ).Subsequently, ψ in the beta network can be similarly estimated as in ( 24) with f t replaced by f t,τ j .Although it allows for quantile-dependent latent factors, this extended CQVAE-based asset pricing model needs to train the related VAE network J different times, so as a tradeoff, its computational burden is inevitable to increase significantly.Based on this reason, we do not further study it below.

Regularization Implementations on the CQVAE Network
As is well known, the high capacity of neural network is flexible to obtain the most informative features from the data, however, it causes a high propensity to overfit.Besides the overfitting problem, the complex structure and rich parameters in neural network also result in the problem of local optima, so that the estimated values of parameters depending on the choice of initial parameters tend to be unstable.To overcome these two prevalent deficiencies, we need to implement some regularization techniques for our CQVAE network.
A standard approach to circumvent overfitting is to partition the full dataset into three disjoint parts: training sample, validation sample, and testing sample, while the temporal ordering is maintained.The training and validation samples are used to do the parameter estimation and hyperparameter tuning, respectively, while the testing sample is only used to evaluate the true out-of-sample performance of our CQVAE method.The implementation details on these three subsamples are summarized below.
For each set of hyperparameter values to implement Algorithm 1, we first compute θ (l) and φ (l) at the lth iteration based on the training sample, and then calculate the related VAE validation sample error, where the VAE validation sample error is the value of the objective function for all datapoints in the validation sample as in ( 22) with θ and φ replaced by their estimators from the training sample.Following the regularization method of "early stopping" to reduce computational cost and regularize against overfitting, we terminate the iteration process early in Algorithm 1 when the VAE validation sample errors increase for several iterations, and take the corresponding output of Algorithm 1 as the SGVB estimators of θ and φ to obtain the estimated latent factor f t .Next, using this f t , we apply the similar "early stopping" regularization method above to implement Algorithm 2 for each set of hyperparameter values, and then get the regularized quantile estimator of ψ and the value of its related qunatile validation sample error, where the quantile validation sample error is the value of the objective function for all datapoints in the validation sample as in ( 24) with ψ replaced by its estimator from the training sample and λ replaced by zero.Moreover, we choose the best SGVB estimators of θ and φ and regularized quantile estimator of ψ with respect to the minimum value of quantile validation sample error across all candidate sets of hyperparameter values for Algorithms 1-2.Using the best estimators of θ , φ, and ψ, we finally evaluate the out-of-sample performance of the CQVAE model in the testing sample under a certain criteria.
To make the computation of Q i,t (τ ) in ( 25) stable, an ensemble regularization approach is adopted in training our CQVAE network.Specifically, we estimate θ , φ, and ψ as described above and compute Q i,t (τ ), based on initial values of parameters generated from a uniform distribution in a random seed (Glorot and Bengio 2010).Then, we average the values of Q i,t (τ ) from multiple random seeds (say, 10) as our final estimates of conditional quantiles.
As any deep neural network, the implementation of CQVAE network depends on the choice of activation functions and hyperparameters, and it is nearly impossible to make this choice optimally in theory.For the activation functions, we follow Kingma and Welling (2014) to use the tanh function in ( 11) and ( 13), and follow Gu, Kelly, and Xiu (2021) to adopt the ReLU function in ( 14).Both tanh and ReLU functions are widely used in the field of deep learning.For the hyperparameters, we summarize their descriptions and candidate values in Table 1.Here, we follow the convention in the literature to offer some candidate values for λ, α, d 1 , d 2 , and d 3 , and then use the validation sample to tune these five hyperparameters as discussed above; see, for example, Goodfellow, Bengio, and Courville (2016) and Gu, Kelly, andXiu (2020, 2021) for similar implementations.The choice of M reveals a tradeoff between estimation robustness and computational cost.For convenience, we choose M = 64 in this article, and certainly, we can take a larger value of M if more computing resources are available.The choice of J makes a balance between approximation accuracy and computational burden.Clearly, a larger value of J usually gives us a better approximation of the conditional distribution of r i,t , however, it also causes heavier computational cost and larger possibility of having the crossing issue.Taking such issues into consideration, we thus recommend simply choosing J = 5.Indeed, although the crossing issue can be solved via the functional delta method in Chernozhukov, Fernández-Val, and Galichon (2010), our results in the supplementary materials show that a larger value of J (e.g., J = 32) only delivers a slightly better performance in asset pricing and portfolio selection.Hence, there seems no need to use a larger value of J in practice.

The 60-year U.S. Equity Data
We apply our CQVAE asset pricing model to analyze the US equity data in Gu, Kelly, andXiu (2020, 2021).This dataset contains (i) monthly individual excess stock returns from March 1957 to December 2016 for all firms listed in NYSE, AMEX, and NASDAQ; and (ii) 94 different predictive characteristics for each stock, among which 61 are updated annually, 13 are updated quarterly, and 20 are updated monthly (see Table A.6 in Gu, Kelly, and Xiu (2020) for more details).Here, we follow Gu, Kelly, and Xiu (2021) to match the characteristics for each stock with the corresponding returns to avoid a forward-looking bias.In total, this dataset has N = 31,925 different stocks, each of which has P = 94 different characteristics during T = 720 months, and the average number of stocks per month exceeds 6,200.
Although the number of stocks is quite large, we do not filter out any stock, since our CQVAE model has the superior capacity to handle the massive data.To deal with the common missing data problem, we replace the missing returns or characteristics by their cross-sectional medians as done in Gu, Kelly, and Xiu (2021).
For stock i, we let r i,t be its return and z * ij,t , j = 1, . .., 94, be its 94 different characteristics at month t.As observed by Gu, Kelly, and Xiu (2021), the asset characteristics are highly skewed and leptokurtic.In order to avoid the undue influence of outlier, we follow their suggestion to do the rank normalization as follows where rank(z * ij,t ) is the rank of z * ij,t within all characteristics {z * ij,t } N i=1 .Clearly, the values of z ij,t lie in (−1, 1) after ranknormalization, and we use the vector z i,t of normalized asset characteristics to do the analysis in the sequel, where z i,t is constructed with the jth entry z ij,t .To check whether the aggregated shocks have their importance in explaining the cross-section of return, we can follow Gagliardini, Ossola, and Scaillet (2016) to add some common instruments (such as the term spread and default spread) to augment z i,t .In the supplementary materials, our analysis finds that the inclusion of term spread and default spread almost gives us no improvement in asset pricing and portfolio selection for the considered 60-year equity data.

Comparison Models
As a comparison, our empirical analysis also considers the CAE, CVAE, and CQAE models.The CAE model in Gu, Kelly, and Xiu (2021) specifies a nonlinear factor structure for the conditional mean of r i,t , where this model employs the AE to learn the latent factor f t and the neural network to learn the nonlinear beta function β(z i,t−1 ) (factor loading).The CVAE model is the same as the CAE model, except that it uses the VAE instead of AE to learn f t as done in our CQVAE model.The CQAE model still uses the AE to learn f t , but it follows our CQVAE model to study the conditional distribution of r i,t .
We adopt a moving-window analysis procedure for all four considered models.Specifically, we use a 35-year moving window and split the data within this window period into three disjoint samples: the training, validation, and testing samples, which contain the data in the first 20 years, middle 10 years, and final 5 years, respectively.To reduce the computational burden, we do not move the window forward each month.Instead, we roll the window forward every five years to refit the models, where the managed portfolios x t in (15) are taken as the input of the factor network, and the candidates of all needed hyperparameters for parameter estimation and hyperparameter tuning are given in Table 1.

Statistical Performance Evaluation
In Gu, Kelly, and Xiu (2021), the out-of-sample performance of CAE model is evaluated by the total and predictive R 2 's (Kelly, Pruitt, and Su 2019) below on the testing sample where 1 10 10 j=1 f t−j , and β i,t−1 and f t are fitted values of β(z i,t−1 ) and f t , respectively, based on the estimator of CAE model on the training sample.The total R 2 quantifies the explanatory power of the contemporaneous factor identification, and it evaluates the model's ability to characterize the individual stock riskiness.The predictive R 2 assesses the prediction accuracy of future individual return, and it measures the model's ability to explain panel variation in risk compensation.Clearly, we can compute the total and predictive R 2 's for the CVAE model in the same way as the CAE model.For the CQVAE model, we compute the total and predictive R 2 's in (32) with where the way to compute μ total i,t and μ pred i,t above follows from ( 25) and ( 27 Table 2 reports the out-of-sample total and predictive R 2 's for individual stocks r t and managed portfolios x t .From this table, we first find that except for the case that K = 6 and the test asset is x t , the CQVAE model always has the best performance in terms of both total and predictive R 2 's, and its advantage over other three models is more significant when the value of K is smaller or the test asset is r t .Particularly, when the test asset is r t , the value of predictive R 2 for the CQVAE model is at least 176% (= (1.71 − 0.62)/0.62)higher than that for the CAE model.Second, we find that the CQVAE (or CVAE) model has larger values of total and predictive R 2 's than the CQAE (or CAE) model in most of cases, especially when K is less than 2. This demonstrates the importance of using the VAE to do asset pricing, since the VAE can avoid the overfitting problem in the AE.Third, we find that the CQVAE and CQAE models always perform better than the CVAE and CAE models, respectively, with only one exception that the CAE model outperforms the CQAE model in the case that K = 6 and the test asset is x t .This finding indicates that it is worthwhile to use quantile factor loadings to learn the conditional distribution of r i,t .Fourth, we find that all four models perform better as the value of K becomes larger, and the CQVAE and CVAE models have more stable performance over K than the other two models using the AE.Note that the value of K reveals a tradeoff between model flexibility and implementation difficulty, since a larger value of K gives us more model flexibility, but meanwhile, it makes the implementation more difficult and time-consuming.
The stability performance of CQVAE and CVAE models implies that the VAE-based models can well balance this tradeoff, since they do not lose too much model flexibility with a small value of K. Finally, we find that the values of total and predictive R 2 's increase magnificently, when the test asset is changing from r t to x t .This finding is expected, since the portfolios can largely average out the idiosyncratic risk in the data.
Besides the total and predictive R 2 's, we also evaluate all considered four models via the mean absolute prediction error (MAPE) analysis and model confidence set (MCS) analysis (Hansen, Lunde, and Nason 2011) in the supplementary materials.The corresponding results show that the CQVAE model has the smallest value of MAPE in most of cases, and when K = 6, it is the only best model among all four considered models in terms of the MCS analysis.

Economic Performance Evaluation
The total and predictive R 2 's, MAPE, and MCS are effective tools to assess the model performance from the statistical viewpoint, while they are inadequate to do model evaluation from the economic viewpoint.As Gu, Kelly, and Xiu (2021), we assess all four considered models by analyzing the Sharpe ratios for portfolios formed based on their predictions of conditional mean and/or variance of return on the testing sample.
For our CQVAE model, we use the idea of utility function in CAPM model (Markowitz 1952)  similarly as in (28): After sorting all stocks into deciles, we consider two types of portfolio.The first type is the long-short portfolio constructed by buying the 10% highest ranking stocks (decile 10) and selling the 10% lowest (decile 1).The second type is the long-only portfolio, which consists of the 10% highest ranking stocks (decile 10) only.Using equal weights, we reconstitute portfolios every month, where the risk aversion parameter λ a is tuned by maximizing the Sharpe ratio on training and validation samples with grid search in the set {0.01, 0.05, 0.1, 0.5, 1, 5, 10} for each portfolio.The CQAE model also uses the above procedure to form portfolios, since it can predict both conditional mean and variance of r i,t as the CQVAE model.However, the CAE and CVAE models are unable to predict the conditional variance of r i,t , so they have to modify the above portfolio-selection procedure by just sorting all stocks based on the prediction of conditional mean of r i,t as done in Gu, Kelly, and Xiu (2021).Table 3 reports the out-of-sample annualized Sharpe ratios of both long-short and long-only portfolios for all four models across different choices of K without transaction costs.From this table, we find that our CQVAE model always has the largest value of Sharpe ratio, whereas the CAE model always has the smallest value of Sharpe ratio.For the long-short portfolios, the value of Sharpe ratio for the CQVAE model is around 19 times higher than that for the CAE model when K = 1, and this advantage remains pronounced when K ≥ 2. For the long-only portfolios, the dominating performance of the CQVAE model over CAE model becomes weaker, although the value of Sharpe ratio for the CQVAE model is still at least 30.9% (= (2.16 − 1.65)/1.65)higher than that for the CAE model.The model behind CQVAE is either the CQAE model when K = 1, 5, and 6 or the CVAE model when K = 2, 3, and 4. In general, the CVAE model like the CQVAE model performs more stable over K than the CAE or CQAE model.This finding is consistent to that from Table 2.
Next, we consider 30 basis points (i.e., 0.3%) transaction costs to avoid non-negligible slippage effect for small-cap stocks in portfolio evaluations by using the method in Engle, Ferstenberg, and Russell (2012); see the similar implementations in Ao, Li, and Zheng (2018) and Li et al. (2022).Table 4 presents the outof-sample annualized Sharpe ratios of long-short and long-only portfolios after deducting transaction costs.From this table, we find that the CQVAE model still outperforms other models, although it is expected that transaction costs make the values of Sharpe ratios smaller for all models.
To further illustrate the advantage of our CQVAE model, Figure 4 plots the cumulative log returns of long-short and longonly portfolios constructed by the CAE, CVAE, CQAE, and CQVAE models with K = 6.For the long-short portfolios, this figure shows that the CQVAE model has the best cumulative log returns over time, and the CQAE model is the second best behind it regardless of whether transaction costs are considered.Both of them have significantly larger values of cumulative log returns than the CAE and CVAE models.As expected, this finding suggests that the mean-variance framework to select portfolios is much better than the mean-only framework.For the long-only portfolios, we can get the same findings from Figure 4, except that the CQVAE model is occasionally worse than the CQAE model.This exception seems reasonable by noting that both CQVAE and CQAE models have close values of Sharpe ratio in this case (see Tables 3 and 4).Compared with the longshort portfolios, the long-only portfolios perform worse in most of cases for all four models.This is probably because unlike the long-short portfolios, the long-only portfolios could not well diversify the panel risk.Furthermore, it is worth noting that when transaction costs are considered, the cumulative log returns for all portfolios have a fall as expected.
Finally, we should point out that some quantile-based risk measures can also be used to either evaluate those formed portfolios above or sort all stocks into deciles to construct portfolios directly.See the similar discussions in de Castro and Galvao (2019) and de Castro et al. (2022aCastro et al. ( , 2022b)).In the supplementary materials, our analysis finds that (i) for the portfolios formed above by using the mean-variance criterion to sort stocks, the CQVAE model still performs the best from the viewpoint of quantile preference; (ii) for those portfolios formed by using the quantile criteria to sort stocks, their performance is generally worse than those formed by using the mean-variance criterion, in terms of both the values of Sharpe ratio and the viewpoint of quantile preference.These findings imply that the advantage of CQVAE method remains under quantile criteria, and the meanvariance criterion is better than those quantile criteria to select portfolios for the considered 60-year U.S. equity data.

Concluding Remarks
This article proposes a new CQVAE model for asset pricing in the realm of big return data.Unlike the existing models in the literature, the key idea of the CQVAE model relies on the estimation of the conditional distribution of the return, based on which the conditional mean of the return is accessible for the purpose of asset pricing, and as a by-product, the conditional variance and quantiles of the return are also available for portfolio selection and risk management.To form the CQVAE model, we use a step distribution function to approximate the unknown conditional distribution of the return, where this step function depends on J different conditional quantiles of the return, and it is optimal under the 1-Wasserstain metric loss.By assuming each conditional quantile has a factor structure with dynamic loadings, we investigate those J different conditional quantiles via a new CQVAE network, which contains a factor network to learn latent factors by the VAE, and adopts a beta network with "multi-head" structure to learn J different beta functions (factor loadings).This CQVAE network allows asset characteristics to determine the structure of quantile-independent latent factors and quantile-dependent factor loadings, so it provides us a new way to do dimension reduction across quantile levels guided by observed covariates.Moreover, we provide a two-step estimation procedure for the CQVAE network, give some regularization implementations for practical use, and discuss an extension model with quantile-dependent latent factors.
When the CQVAE model is applied to analyze the 60-year US equity market from 1957 to 2016, we find its two remarkable advantages over the benchmark competitor the CAE model.First, the CQVAE model has more stable (across the number of latent factors) and larger values of the out-of-sample total and predictive R 2 's than the CAE model.In particular, when the test assets are individual returns, the CQVAE model has at least 176% higher value of out-of-sample predictive R 2 than the CAE model.This advantage is not unexpected, since the VAE used by the CQVAE model could largely alleviate the overfitting problem in the AE adopted by the CAE model.Second, the CQVAE model outperforms the CAE model by a wide margin in terms of the out-of-sample Sharpe ratios for both long-short and long-only portfolios.This is because unlike the CAE model, the CQVAE model constructs portfolios in the mean-variance framework, owing to its predictability for both conditional mean and variance of the return.
Overall, the CQVAE model serves as a new tool for asset pricing, and its superior capacity allows us to get informative features from rich datasets.The core element of the CQVAE model is a new nonlinear quantile factor model with dynamic loadings, whose structure is specified by the CQVAE network.This new nonlinear quantile factor network architecture is interesting in its own rights, and it could have a large application scope for many other studies.

Figure 1 .
Figure 1.The architecture of (a) VAE network and (b) AE network.Note: both VAE and AE attempt to compress the high-dimensional input (in orange) into a low-dimensional set of neurons in one hidden layer (in yellow), except that the latent space of VAE is regularized via the variational probability and reparameterization trick.
3 and W 7 ∈ R d 3 ×P are matrices of weight parameters, b 6,τ ∈ R d 3 ×1 and b 7 ∈ R P×1 are vectors of bias parameters, d 3 is the number of units in the hidden layer,

Figure 2 .
Figure2.The architecture of CQVAE network.The beta network (left panel) describes how J different quantile factor loadings {β τ 1 (z t−1 ), . .., β τ J (z t−1 )} (in green) depend on a P-dimensional vector of asset characteristics z t−1 (in purple) through a "multi-head"neural network with one hidden layer (in gray).Each row of green nodes represents a K-dimensional vector of factor loadings β τ j (z t−1 ) at one quantile level τ j .The factor network (right panel) describes how the latent factors f t are obtained from an Ndimensional vector of individual asset returns r t (in orange) via probabilistic encoder of VAE (in gray).The pink nodes in output layer are J different conditional quantiles computed by multiplying each row from the beta network with the vector of latent factors (in yellow) from the factor network.

Figure 3 .
Figure 3.An example to illustrate the W 1 metric.The shaded regions sum to form the W 1 metric between true distribution F i,t (r) (black solid line) and its optimal approximated step distribution F * i,t (r) (blue solid line) for given τ * 0 , . .., τ * 5 .

Figure 4 .
Figure 4. Cumulative log returns of long-short portfolios and long-only portfolios, based on the CAE, CVAE, CQAE, and CQVAE models with K = 6.Left panels: without transaction costs.Right panels: with transaction costs.

Table 1 .
Hyperparameters.Tuning parameter in l 1 penalty function 10 −3 , 10 −2 , 10 −1 α Learning rate in the Adam algorithm 10 −4 , 10 −3 , 10 −2 d 1 (= d 2 = d 3 ) ), and β τ j (z i,t−1 ) and f t are fitted values of β τ j (z i,t−1 ) and f t , respectively, based on the estimator of CQVAE model on the training sample.Similar to the CQVAE model, we are able to compute the total and predictive R 2 's for the CQAE model.Besides the total and predictive R 2 's for individual stocks r t , we can also directly compute these two R 2 's for managed portfolios x t with r i,t , μ total i,t , and μ

Table 2 .
Out-of-sample R 2 total and R 2 pred (in percentage) comparison.

Table 3 .
to sort all stocks by μ Out-of-sample Sharpe ratios of long-short and long-only portfolios (without transaction costs).
pred i,t , where λ a is the risk aversion parameter, μ pred i,t is the prediction of conditional mean of r i,t defined as in (33), andh pred i,tbelow is the prediction of conditional variance of r i,t computed

Table 4 .
Out-of-sample Sharpe ratios of long-short and long-only portfolios (with 30 bps transaction costs).