Estimating Primary Demand for a Heterogeneous-Groups Product Category under Hierarchical Consumer Choice Model

Abstract This paper discusses the estimation of primary demand (i.e., the true demand before the stockout-based substitution effect occurs) for a heterogeneous-groups product category that is sold in the department store setting, based on historical sales data, product availability, and market share information. For such products, a hierarchical consumer choice model can better represent purchasing behavior. This means that choice occurs on multiple levels: A consumer might choose a particular product group on the first level and purchase a product within that chosen group on the second level. Hence, in the present study, we used the nested multinomial logit (NMNL) choice model for the hierarchical choice and combined it with non-homogeneous Poisson arrivals over multiple periods. The expectation-maximization (EM) algorithm was applied to estimate the primary demand while treating the observed sales data as an incomplete observation of that demand. We considered the estimation problem as an optimization problem in terms of the inter-product-group heterogeneity, and this approach relieves the revenue management system of the computational burden of using a nonlinear optimization package. We subsequently tested the procedure with simulated data sets. The results confirmed that our algorithm estimates the demand parameters effectively for data sets with a high level of inter-product-group heterogeneity.


Introduction
Demand estimation is critical for retailers, as it enables optimal assortment decisions and optimal inventory management. Most retail stores save their sales transaction history and, based on thus data, attempt demand estimation in various ways (e.g., using time-series models). One of the challenges of demand estimation arises from temporary stockouts of products. When facing a stockout of a product that they had intended to purchase, consumers either leave without making any purchase or purchase, as a substitute, a related product that is available in the store. The first scenario is referred to as a lost sale and the second as substitute demand. Existing inventory systems, which store only sales transactions and product inventories, are not capable of logging lost sales and substitute demand. Thus, the saved transaction data are referred to as truncated or censored. Therefore, estimating demand based on censored sales transaction data, without considering the effect of stockouts, would yield a result different from the true demand. We refer to this true demand, which is to say the first-choice demand when all products are in stock, as the primary demand.
Another central demand characteristic is the hierarchical structure of consumers' purchasing behavior. When a product category with heterogeneous groups is sold in a traditional department store setting (i.e., where a variety of products with different brands and different features are sold in the same store), consumer choice seems to occur hierarchically. This means that CONTACT Yongsoon Eun yenu@dgist.ac.kr Supplemental data for this article can be accessed at www.tandfonline.com/uiie choice occurs on multiple levels: a consumer might choose a particular product group on the first level and then purchase a product within that group on the second level. Indeed, Kök and Xu (2011) introduced two hierarchical purchasing behaviors, brand-primary process and type-primary process. (See Section 2 for a detailed literature review.) If hierarchical choice behavior exists, it affects the product substitution pattern and, in turn, the substitute demand and lost sales. Thus, in primary demand estimation, it is important to take into account hierarchical consumer choice behavior.
A widely employed demand-estimation approach is to use discrete choice models such as Multinomial Logit (MNL) or Nested Multinomial Logit (NMNL) (see Ben-Akiva and Lerman (1985) and Train (2009)). Vulcano et al. (2012) integrated the MNL model with the Expectation-Maximization (EM) algorithm. This strategy, however, is limited when a product category with heterogeneous groups is considered, due to the MNL model's Independence from Irrelevant Alternatives (IIA) property (Ben-Akiva and Lerman, 1985;Train, 2009)). As Kannan and Wright (1991) demonstrated, it is the NMNL discretechoice model that is suitable for hierarchical consumer choice behavior. Accordingly, in the present study, we integrated the NMNL model with the EM algorithm. Our work can be viewed as having extended the work of Vulcano et al. (2012) to cases wherein hierarchical consumer choice behavior exists.
We followed the procedure of Vulcano et al. (2012) wherein the estimation problem is viewed in terms of primary demand and the observed sales data are treated as incomplete observations of that demand. The demand parameters to be estimated are (i) the inter-product-group similarity; (ii) the preference vector of products; and (iii) the arrival rates. The first two originate from the NMNL choice model. To estimate all three of them, the EM algorithm was applied. Accordingly, first, the hierarchical consumer choice behavior that might underlie the demand structure was identified. To that end, the demand parameters were estimated under two different hierarchical structures: the brand-primary model and the typeprimary model. The one with the greater incomplete data loglikelihood was chosen as the demand structure. The substitute demand and lost sales were then estimated accordingly. The estimation required only censored sales transaction data, product availability, and market share information.
Unlike the case in Vulcano et al. (2012), direct application of the EM algorithm does not yield a closed-form solution in its M-step, due to the newly introduced inter-product-group similarity parameter. This is a complex multivariate maximum likelihood estimation problem. Thus, rather than resorting to nonlinear optimization packages in each M-step, we restructured the estimation problem as an optimization problem with respect to the inter-product-group similarity parameter.
We tested the proposed procedure on simulated data sets and compared the estimation results with those of Vulcano et al. (2012). The proposed method effectively reveals the hidden hierarchical demand structure and demand parameters for data sets with a high level of inter-product-group heterogeneity. The following are the most interesting numerical results.
1. Our algorithm predicted the choice hierarchy based on the substitution patterns within the product groups in the data set. This was numerically demonstrated by considering the two hierarchical structures: the brandprimary model and the type-primary model. 2. We showed numerically why more accurate demand estimation is important for optimal assortment decisionmaking and inventory management. 3. We showed how to estimate market share, one of the input variables in our estimation problem, in cases where it is unknown. In a revenue management system that deals with hundreds of thousands of products, both speed and performance are important. Computationally intensive demand-estimation methods for solving assortment problems and inventory problems within a small processing time window, unfortunately, are not practically viable. Our contribution to the literature is the development of a simple optimization problem for estimation of the primary demand for a product category with heterogeneous groups. The optimization is effectively achieved without any need to resort to a nonlinear optimization package, thus relieving a revenue management system of what otherwise would be a heavy computational burden.

Literature review
A brief review is given on the existing literature that addresses the problem of estimating stockout-based substitution rates and lost sales. Anupindi et al. (1998) presented a method to estimate demand parameters, arrival rates, and substitution probabilities. They used the EM method to obtain the maximum likelihood estimation while treating the time of stockout as missing data under a periodic-review system. To manage the rapidly growing number of parameters, they imposed a constraint that at most two products are out of stock at any given period. Campo et al. (2003) studied the impact of stockouts on purchase quantity and substitution in different consumer segments. They found that stockouts may lead to the purchase of smaller quantities, and asymmetric (non-IIA) choice shifts. They proposed a modification of the MNL model for more general stockout-based substitution. Consumers who typically purchase multiple units per purchase might decide to buy fewer packs in case of stockouts to limit the risk of buying a less familiar alternative. Consumers are exposed to the loss of intrinsic utility following the choice of a less preferred item. They allowed for heterogeneous consumer reactions to stockouts using latent class estimation. Kalyanam et al. (2007) provided an empirical study on the impact of the individual item's presence on the sales of other items. The mere presence of certain items increases the sales volume of the whole assortment. They studied the impact of the absence of each individual item on the category sales. Their model allowed for a more general substitution pattern than the MNL model. Using a hierarchical Bayes framework, they estimated parameters for a joint model of sales and stockouts. They developed a system of equations that accounted for the interrelationship between sales, demand, and stockouts. The presence of individual items could increase the overall category sales by drawing attention of consumers who seek variety. However, some stockouts increased category demand because of the elimination of clutter by removing non-popular items. Kök and Fisher (2007) developed a method to estimate the substitution and demand for products and proposed an EM method to solve the assortment planning problem. The M-step resorted to the nonlinear optimization technique due to the complexity of the likelihood function. Bruno and Vilcassim (2008) proposed a model to account for varying levels of observed product availability. They pointed out that the set of products available in a particular store is not observable even though the aggregate level of distribution can be observable. They simulated potential product assortment vectors that each consumer faces in the actual store visit by drawing multivariate Bernoulli vectors based on the observed aggregate level of availability. The parameters were estimated by using the simulation. Conlon and Mortimer (2009) considered an estimation of demand for periodic inventory system where the choice set of an individual consumer is latent. They developed a method incorporating the observation by using the EM algorithm to estimate primary demand across the unobserved choice sets. Just as in the model of Anupindi et al. (1998), the number of parameters to be estimated in the E-step grows exponentially when multiple products are simultaneously out of stock. Musalem et al. (2010) considered substitution effects due to stockouts under the periodic-review inventory system (incomplete information about product availability). Instead of the maximum likelihood estimation method, they used the Markov chain Monte-Carlo simulation method to estimate model parameters. Vulcano et al. (2012) combined the MNL choice model with the EM algorithm to estimate primary demand. They applied the EM algorithm to estimate demand parameters for the incomplete sales transaction data with hidden variables of lost sales and recaptured demand. They developed a simple procedure consisting of closed-form expressions in each EM iteration. This method is efficient for a product category with homogeneous groups; however, it is limited for a product category with heterogeneous groups where hierarchical consumer choice behavior exists.
Fisher and Vaidyanathan (2014) considered a demand estimation procedure and retail assortment optimization. They allowed assortments to vary by stores, subject to a maximum number of different assortments. They viewed a Stock-Keeping Unit (SKU) as a set of attribute levels. By applying maximum likelihood estimation to sales history, they estimated the demand for attribute levels and substitution probabilities. This estimation enabled the retailer to estimate the demand for any potential SKU, including those not currently carried.
We also provide a brief review on the literature related to hierarchical consumer choice behavior. Kannan and Wright (1991) applied the NMNL model to the coffee market and showed that the NMNL model performs better than the MNL model for the market where hierarchical consumer choice behavior exists. Kök and Xu (2011) introduced two hierarchical purchase behaviors, namely, the brand-primary process and the typeprimary process, and studied optimal product assortment planning and pricing under the two hierarchical consumer choice models. In the brand-primary process, consumers choose a brand first and then choose a product type within that brand. This process primarily models consumer choice behavior when different product types satisfy functional needs of consumers with strong brand loyalty (e.g., ice creams with different flavors). The type-primary process, on the other hand, models consumer behavior of selecting a product type first and then choosing a brand (e.g., regular coffee versus decaffeinated coffee). In this case, product types are functionally differentiated. Gallego and Topaloglu (2014) studied constrained assortment optimization problems under the nested logit model. They showed that the optimal assortment under cardinality constraints (i.e., limiting the number of products) can be obtained by solving a linear program. They also showed how to obtain an assortment with a performance guarantee of two for the NPhard assortment optimization problem under space constraints (i.e., limiting the total space consumption). The joint assortment optimization and pricing problem can be effectively solved by using the developed method for constrained assortment optimization.

Model description
We consider T period sales transaction data of a product category with heterogeneous groups (e.g., coffee, womens' dresses) that is sold in the same store. Arriving consumers follow the two-stage hierarchical choice process. Depending on how the choice process occurs, products are partitioned into K groups.
In the brand-primary process, consumers choose a brand first and then a product type within that chosen brand. In the typeprimary process, consumers choose a product type first and then a brand within that chosen product type. The product group corresponds to the product brand in the brand-primary model and the product type in the type-primary model. Hence, the product group represents the higher or first level of consumers' choice.
Time periods are indexed by t = 1, . . . , T , and product groups are indexed by k = 1, . . . , K. The set of all the products in group k is denoted by B k , and the set of available products in period t is denoted by S kt , satisfying S kt ࣪ B k . Time period t is decided such that any product in S kt is available and any unavailable product is unavailable throughout period t. This assumption is reasonable in the perpetual inventory system, in which actual stockout events can be observed. It should also be noted that perpetual inventory systems are becoming more widely used in retail environments. Even in periodic inventory systems, retailers tend to collect inventory information more frequently to mitigate the limitation of missing exact stockout events. The observed number of purchases of product j from group k in period t is denoted by z kjt . If product j from group k is not available in period t, we define z kjt = 0. We do not consider product returns, which implies z kjt ࣙ 0. We assume that for all product j from group k, there is at least one period t where z kjt > 0. Otherwise, we do not consider the product in our analysis.
The total number of purchases in period t is m t = K k=1 j∈B k z k jt . For the sake of notational convenience, we define B = (B 1 , B 2 , . . . , B K ), S t = (S 1t , S 2t , . . . , S Kt ), and z t = (z 1t , . . . , z Kt ) where z kt = (z k jt , for all j ∈ B k ). We assume that the number of consumer arrivals is Poisson-distributed with mean λ t in period t. The vector of arrival rates is denoted by λ = (λ 1 , . . . , λ T ).

Hierarchical choice process
Arriving consumers in period t follow the two-stage hierarchical choice process as shown in Fig. 1. From available products S t , they first select a product group to purchase or no-purchase. The no-purchase decision is viewed by a fictitious product group indexed by a zero. If they choose group k (k ࣔ 0), then they next select a product to purchase from S kt . If they choose the no-purchase option 0 on the first stage, they leave without any purchase.
We use the NMNL model (Ben-Akiva and Lerman, 1985) in order to model the hierarchical consumer choice process. The utility of a consumer associated with purchasing product j from where u kj represents the expected utility of the product, which is identical across all consumers; ϵ k is the random shared unobserved utility component attributable to group k; and ϵ kj is the remaining random total unobserved utility component. The utility of the no-purchase option is U 0 = u o + ϵ 0 . Note that consumers solely compare differences of utilities, so the absolute level of utility is irrelevant to consumers' choice (Train, 2009). Hence, the constant parts of the utilities are normalized to be u 0 = 0. This can be done by subtracting u 0 from all the utilities. In the NMNL model, ϵ k and ϵ kj are assumed to be independent for all j ࢠ B k , k = 1, . . . , K. This implies that products in the same group are correlated as the covariance of two utilities in group k is var(ϵ k ) > 0. After a further normalization of utilities, ϵ kj are independent and identically distributed zero-mean Gumbel random variables with scale parameter 1. The variable ϵ k are distributed so that max j∈B k U k j are Gumbel distributed with scale parameter μ ࢠ [0, 1], and ϵ 0 is a zero-mean Gumbel random variable with scale parameter μ. The zero-mean Gumbel random variable with scale parameter μ has a distribution function of where γ is Euler's constant and has an appropriate value of 0.5722. The preference (or attractiveness) of product Under the above conditions, the probability of no-purchase, denoted by P 0 (S t ; v, μ), and the probability of purchasing product j from group k (j ࢠ S kt ), denoted by P k j (S t ; v, μ), in period t are calculated in the following lemma.
Lemma 1. (Ben-Akiva and Lerman, 1985) The probability of no-purchase in period t is The probability of purchasing product j from group k (j ࢠ S kt ) in period t is given by We call scale parameter μ the inter-product-group similarity because the greatest product similarity is achieved when μ = 1 (note that μ ࢠ [0, 1]). When μ = 1, the NMNL model reduces to the standard MNL model where all products are horizontally heterogeneous. In this case, the probability of purchasing product j from group k is v k j / K m=1 l∈S k v ml + 1. As μ goes to zero, the inter-product-group heterogeneity increases. Now we consider the substitution pattern in the NMNL model. When a product is out of stock, the demand may be substituted by one of the available products. Due to the substitution effect, purchase probabilities of other available products increase. To know the substitution effect in the NMNL model, we introduce the Independence from Irrelevant Nests (IIN) property (Train, 2009). The ratio of the purchase probability of product j from group k and that of product l from group m is The MNL model has a restrictive choice behavior of the IIA property where the ratio of purchase probabilities of two alternatives are the same under two different choice sets containing the two alternatives (Ben-Akiva and Lerman, 1985;Train, 2009). In the NMNL model, the IIA property holds for products within a group because the above ratio is v k j /v ml when k = m. However, it does not hold for products in different groups. The ratio depends on preferences of products in group k and group m but not on preferences of products in a group different from k and m. This property is called the IIN because it is similar to the IIA on a group level (Train, 2009). Due to the IIN property, the NMNL model has a special substitution pattern. The substitution effect occurs in the product group level. A product's stockout creates a greater effect on products in the same group than products in different groups. Specifically, the percentage increase of purchase probability of products in the same group is greater than that of products in different groups. However, if the products are horizontally heterogeneous as in the MNL model (i.e., μ = 1), the percentage increase of purchase probabilities are all the same. We state this result in the following lemma.
Lemma 2. In the NMNL model with μ < 1, a product's stockout leads to a higher percentage increase of purchase probability in the same group than those in other groups. However, when μ = 1 (i.e., the MNL model), the percentage increase of purchase probabilities are same for all the products.
The following example shows how the substitution effect occurs in the level of product group for the NMNL model.
Example 1: Assume that there are four products to sell in a store: product 1, product 2, product 3, and product 4. Products 1 and 2 belong to group 1, and products 3 and 4 belong to group 2. The inter-product-group similarity is given by 1.5, 0.8, 1, 0.4). Note that the preference of no-purchase is normalized to be one. Using the NMNL model, we see that the purchase probability changes when product 1 is out of stock. In Table 1, we summarize the purchase probabilities and the no-purchase probability. The probabilities are in the third column when all of the products are available. The probabilities after the stockout of product 1 are in the fourth column. The ratio of probabilities after to before the stockout is in the fifth column. All of the probabilities are increased after the stockout. However, the ratio of product 2 is greater than others. Therefore, product 2, which is in the same group (with the product 1), has been greatly influenced by the stockout event.

Incomplete data likelihood function and market share
The sales data are incomplete observations of primary demands. Using the preference vector v = (v kj , j ࢠ B k , k = 1, . . . , K), inter-product-group similarity μ, and arrival rates λ = (λ 1 , . . . , λ T ), we write the incomplete data likelihood function of observing the sales data: where Pr(m t |v, μ, λ) = Pr(m t customers purchase in period t|v, μ, λ) The number of consumers purchasing in period t, which is m t , follows a Poisson distribution with mean λ t (1 − P 0 (S t )) and is the conditional probability of purchasing product j from group k given that a consumer makes a purchase. The direct maximization of the incomplete data likelihood function with respect to v, μ, and λ is not promising due to the likelihood function (and the log of it) having a complex form without much structure even for μ = 1 as reported in Vulcano et al. (2012). The continuum of maxima for the MNL model (Vulcano et al., 2012) still holds for the NMNL model. Let (v * , μ * , λ * ) be a maximizer of the incomplete data likelihood function in Equation (3). For any α > 0, we define a new preference vector v 0 = αv * and a new arrival rate: ). Therefore, (v 0 , μ * , λ 0 ) has the same likelihood value and is also a maximizer for any α > 0. Note that this is the extension of the solution presented by Vulcano et al. (2012), in which the new preference vector is v 0 = α v * and the new arrival rate Note that the new arrival rate in the MNL model can be obtained by applying μ = 1 to the new arrival rate in the NMNL model. To resolve the multiple optimal solutions, Vulcano et al. (2012) assumed that there is an exogenous estimate of the the store's market share (or market potential). We denote the market share as s is the probability of no-purchase and P kj (B; v, μ) is the purchase probability of product j from group k under the complete product assortment (i.e., no stockouts). This article also assumes that market share s is exogenously given.

Likelihood function based on primary demand
Instead of solving the Maximum Likelihood Estimation (MLE) of the incomplete data likelihood function in Equation (3), we view the estimation problem with respect to the primary demands as described by Vulcano et al. (2012). The primary demand of a product is the number of consumers whose first choice is that product.
Let Z kjt be the random number of purchases of product j from group k in period t. The number can be broken down into two terms: the primary demand of that product and the substitute demand from other unavailable products. Therefore, where X kjt is the primary demand of product j from group k and Y kjt is the substitute demand to product j from other unavailable products. We also let X 0t be the primary demand of no-purchase and Y 0t be the lost sales in period t.
If the primary demands (i.e., X kjt and X 0t ) are available, the likelihood function can be written as We let N k j = T t=1 X k jt and N 0 = T t=1 X 0t , which are the total primary demands over all the periods. The log-likelihood function is (4)

EM algorithm under the NMNL model
In this section, we describe the E-step of the EM algorithm to calculate the conditional expected log-likelihood function when purchase data z t and parameter estimators (v,μ) are given. We then describe the M-step to calculate the MLEs of the conditional expected log-likelihood function.

E-Step
In the E-step, we attempt to obtain the conditional expected values of primary demands that are denoted byX Combining the two equations, we havê Next, if product j from group k is available (i.e., j ࢠ S kt ), we havê X k jt =λ t × P k j (B;v,μ) and z k jt =λ t × P k j (S t ;v,μ). From the two equations, we havê Finally, the conditional expected value of the primary demand of no-purchase can be derived fromX 0t =λ t × P 0 (B;v,μ),X 0t + K k=1 l∈B kX kl =λ t , and s = 1 − P 0 (B;v,μ), where s is the market share. We havê We summarize the results in the following lemma.
Lemma 3. The conditional expected value of the primary demand of product j from group k in period t iŝ for all j ∈ S kt and the conditional expected primary demand of no-purchase is given byX where s is the exogenously given market share of the store.
By using the estimated primary demand ofX k jt , we estimate the substitute demand for product j from group k as follows: Note that the substitute demand is negative when the product is not available. The number of lost sales is estimated aŝ The total primary demand of product j in group k over all periods is estimated asN k j = T t=1X k jt j∈B kX k jt in Lemma 3). Note thatN k j > 0 andN 0 > 0 because T t=1 z k jt > 0 (i.e., it is assumed that z kjt > 0 in at least one period t). The conditional expected log-likelihood in Equation (4) for a given parameter set (v,μ),

M-Step
In the M-step, we find estimators to maximize the conditional expected log-likelihood given in Equation (7). In this section, we find the critical point and study its properties. All of the proofs are in Online Supplement D.
The conditional expected log-likelihood in Equation (7) is a multivariate function that depends on preference vector v and inter-product-group similarity μ. Unlike the MNL model by Vulcano et al. (2012), the closed-form solution of the parameters is difficult to obtain due to the additional parameter μ. The key idea behind our approach is to calculate the critical point of the preference vector while keeping μ constant. Then we can obtain the closed-form solution for the preference vector as a function of μ. Furthermore, the critical point is unique. We state the result in the following proposition. For a given value of μ (i.e., constant μ), the conditional expected log-likelihood function in the E-step is unimodal, and its unique optimal solution is

Proposition 1.
The log-likelihood function in Equation (7) is not jointly concave (see Vulcano et al. (2012) for the case where μ = 1); however, it is unimodal and its unique optimal solution has a closed form. Note that v * k j (μ) > 0 becauseN kl andN 0 are strictly positive. We first consider the special case where μ = 1 (i.e., the MNL model). In this case, the optimal solution becomes v * k j (μ) =N k j /N 0 , which is the ratio of the total primary demand of product j to the total primary demand of no-purchase. Note that the preference of no-purchase is normalized to be one. Hence, if the total primary demand of a product is greater than the total primary demand of no-purchase, the preference weight of the product is greater than one (and vice versa).
We now consider that μ is strictly less than one. The v * k j (μ) in Equation (8) is the product of two terms. The first term is the ratio of the total primary demand of product j to the total primary demand of its group k, which is between zero and one. This term represents the popularity of that product within its group (i.e., it is more popular in its group as this term approaches to one). Note that this term does not depend on μ. The equation in the parenthesis of the second term is the ratio of the total primary demand of its group k to the total primary demand of no-purchase. The value of μ gives a different effect on the second term depending on whether or not the ratio is greater than one. We call group k a popular group if its total primary demand is greater than the total primary demand of no-purchase (i.e., l∈B kN kl >N 0 ) and call group k an unpopular group if l∈B kN kl <N 0 . We also call group k a neutral group if l∈B kN kl =N 0 . As the level of inter-product-group heterogeneity increases (i.e., smaller value of μ), the product in a popular group has an increased preference weight, but the product in an unpopular group has a decreased preference weight. The preference weight of a product in a neutral group does not depend on the heterogeneity level. Therefore, the NMNL model assigns more preference weights to products in popular groups as the level of inter-product-group heterogeneity becomes higher.
It is difficult to maximize the conditional expected loglikelihood with respect to both the preference vector and the inter-product-group similarity at the same time. To resolve the difficulty, our proposed method is to make the EM algorithm a sub-module of an optimization problem that is called repeatedly with a different value of μ. By Proposition 1, the M-step has a simple and closed-form calculation. The optimization problem is defined in terms of μ to maximize the incomplete data likelihood in Equation (3), calling the EM algorithm as its sub-module. Specifically, let v EM k j (μ) be the outcome of the EM algorithm for a given μ. The value of μ that maximizes the incomplete data likelihood function L(v EM (μ), μ,λ) defined in Equation (3) is the optimal value of μ * and the corresponding optimal preference weight for product j from group k is obtained by v EM k j (μ * ). The number of arrivals in period t is estimated bŷ As the conditional expected log-likelihood functionLL(v, μ) in Equation (7) is continuous in both v andv, the converged value of v EM k j (μ) is the stationary point of the incomplete data likelihood function for a given μ (Wu, 1983). Therefore, the outcome of the optimization problem is the stationary point of the incomplete data likelihood function with the optimal level of the inter-product-group similarity.
The next proposition shows that market share constraint s = 1 − P 0 (B; v, μ) is kept through the iterations of the EM algorithm. Therefore, if the initial values of the preference vector and the inter-product-group similarity satisfy the market share constraint, all of the updated solutions in the iterations of the EM-algorithm also satisfy the market share constraint.

Overall estimation procedure
We numerically search for the optimal value of μ * that maximizes the incomplete data log-likelihood function in Equation (3) from the converged preference vector from the EM algorithm for each μ. We start from μ = 1 and decrease μ by a small amount δ. The optimal value of μ is achieved when the value of the incomplete data log-likelihood does not improve by decreasing the value of μ. The summarized procedure is as follows.

(E-
Step.) Update X kjt and X 0t j ࢠ B k , k = 1, . . ., 1.4 (EM-stopping criterion.) Check K k=1 j∈B k |v new k j − v k j | ≤ , or the number of iterations is greater than the maximum number of iterations, 1.4.1 If true, then go to Step 2. 1.4.2 If false, set v k j = v new k j, ∀ j ∈ B k , k = 1, . . . , K and go to 1.2 of the E-Step.
Step 3. (Stopping criterion.) Check log L new ≤ log L, or μ − δ < 0. 3.1 If true, the procedure stops with the estimation result. 3.2 If false, set μ = μ − δ and go to step 1.1 of the initialization of EM algorithm.

Brand-primary choice model and type-primary choice model
If the choice hierarchy of purchasing products is known (either the brand primary or the type-primary), then the proposed method is applied to estimate the demand parameters. Now, assume that there is no knowledge about the choice hierarchy. The hidden choice structure then can be revealed by applying the proposed method to both the brand-primary choice model and the type-primary choice model. The one with the higher likelihood value is chosen for the choice hierarchy model. We show this by using simulated data in Section 5.

Numerical study
We generated simulation data from a known demand structure where consumers follow the NMNL choice process. The proposed EM algorithm based on the NMNL choice model was applied to the simulated data to show how the algorithm estimates the demand parameters. We compared the results with the original demand parameters along with those by Vulcano et al. (2012). We simulated the data with a homogeneous arrival rate of λ = 50 over a selling horizon of T = 15 periods. We assumed that consumers followed the brand-primary choice process. There were two product brands, brand A and brand B, and three different product types in each brand, 1, 2, 3. The preference vector was given by , v B3 )) = ((1, 0.5, 0.1), (0.9, 0.4, 0.05)). The preference weight of no-purchase was normalized to be one, v 0 = 1. We used an inter-product-group similarity of μ = 0.3. In this case, the exogenously given market share is s = 0.6919. The randomly simulated sales data are summarized in Table 2. The product unavailability was exogenously set before the simulation and is represented as "NA. " Note that arrivals are unobservable in the sales data because no-purchases are unobservable (the number of arrivals is the sum of purchases and no-purchases). Table . Simulated purchases and no-purchases over  periods for a product category with two brands and three types in the brand-primary choice process (NA denotes non-availability)

Periods Brand
Type To estimate the demand parameters, we applied the EM algorithm based on the NMNL choice model to the observable sales data in Table 2. We assumed that the hierarchical type is unknown. Thus, we set up two different hierarchical models: the brand-primary model and the type-primary model. The value of incomplete data log-likelihood under the brand-primary model was −130.5036 with an estimated interproduct-group similarity of 0.25, and that under the typeprimary model is −140.5106 with an estimated inter-group similarity of one (i.e., the type-primary model was reduced to the MNL model). This means that the MNL model fit the data sets better than the NMNL model if a wrong hierarchy was imposed. The value of the incomplete data log-likelihood was greater under the brand-primary model; thus, we selected the brand-primary model for the data. The estimations of μ and v are given in Table 3. The second column shows the true values of the parameters; the third column shows the estimation results obtained using the EM algorithm based on the MNL, denoted by EM_MNL; and the fourth column shows the estimation results obtained using the EM algorithm based on the NMNL, denoted by EM_NMNL. Note that the parameter estimations by EM_MNL can also be obtained from the initial solution of EM_NMNL (initially, we set μ = 1). The percentage in the parentheses denotes the bias from the true value. The parameter bias of EM_NMNL is smaller than that of EM_MNL except forv B2 .
The MLEs are consistent, asymptotically efficient, and normal. Thus, the asymptotic variance-covariance matrix of the estimators is −[∇logL] −1 , where ∇logL is the matrix of the second derivative of the incomplete data log-likelihood with respect to the parameters evaluated at the estimated values of the parameters (Ben- Akiva and Lerman, 1985). In Online Supplement C, we present the second derivatives of the incomplete  Table 3. For example, the ASE of μ is 0.0142 and the ASE of v A1 is 0.5080. Due to the asymptotic property for the large sample data, the ASE can be reduced as the number of data increases (i.e., more period data). The estimated substitute demands by EM_MNL and by EM_NMNL are given in Table 4. Two interesting points are observed by comparing the estimations obtained from the two algorithms. First, stockouts of product A1 in periods 6, 7, and 8 tend to be substituted more often by products in brand A than products in brand B under EM_NMNL than under EM_MNL. This is expected because all of the products' purchase probabilities increase of the same percentage under the MNL model, whereas the purchase probabilities of products in the same product group increase with a higher percentage under the NMNL model after the stockout (see Lemma 2). Second, the estimated value of lost sales is much smaller under the NMNL model than under the MNL model. The estimatedμ = 0.25, which is far lower than one. This means a higher substitution within brands after stockouts (less lost sales). The primary demand estimations obtained using EM_MNL and EM_NMNL are given in Table 5. Note that the primary demands of A2 and A3 are smaller in periods 6, 7, and 8 under the NMNL model than under the MNL model. It is also noteworthy that the primary demands of brand B are larger in periods 6, 7, and 8 under the NMNL model than under the MNL model. The MNL model does not consider group-wise substitution. It thus overestimates the primary demand if its brand is the same as that of the unavailable product and underestimates the primary demand otherwise.
The average of the arrival rate estimations obtained using EM_MNL is 57.6 with 15.2% bias from the true value of 50, and that by EM_NMNL is 45.05 with a bias of −9.91%. The estimation bias is also reduced in this case. The percentage of lost sales was estimated as 157.9 597.9 = 26.4% under the MNL model, and it was estimated as Pr(lostsales) = 27.7 467.7 = 5.92% Table . Substitute demand estimations,Ŷ k j , for simulated data in Table  Periods under the NMNL model. The total substitute rate is estimated by under the MNL model, and it was estimated as 157.2/467.7 = 33.62% under the NMNL model. As mentioned earlier, a higher substitution within a product group results in the reduction of lost sales and a higher substitution rate. We used the likelihood ratio index to compare the estimation results of EM_NMNL and that of EM_MNL. The null hypothesis of the likelihood ratio test was H 0 : μ = 1, and the alternative hypothesis was H 1 : μ ࣔ 1. Let log L(EM_MNL) be the log-likelihood by EM_MNL and log L(EM_NMNL) be the log-likelihood by EM_NMNL. Note that the MNL model was a NMNL model with a constraint of μ = 1. The likelihood ratio index given by −2(log L(EM_MNL) − log L(EM NMNL)) had the chi-square distribution with a degree of freedom equal to the number of constraints, which is one in this case (see Train, 2009). As log L(EM_MNL) = −140.5106 and log L(EM_NMNL) = −130.5046, the likelihood ratio index was 20.0120. This value is greater than the 5% quantile of the chi-squared value with one degree of freedom, 3.841. Therefore, we rejected the hypothesis that the consumer choice model was MNL (i.e., μ = 1).
We also considered in and out-of-sample statistics for the model selection criteria. First, we computed the Root Mean Squared Errors (RMSEs) of the two algorithms. We estimated the number of purchases of product j from group k asλ t × P k j (S t ;v,μ). The RMSE of EM_MNL was 20.2714 and that of EM_NMNL was 14.4218. Next, we computed the Root Mean Forecast Squared Errors (RMFSEs). We simulated additional sales date from period 16 to period 20 and forecasted the number of purchases of each product obtained using the two algorithms. The RMFSE of EM_MNL was 14.4108 and that of EM_NMNL was 11.5745. The RMSE and RMFSE of EM_NMNL were smaller than those of EM_MNL, respectively. Hence, the parameters of EM_NMNL fit the data better than those of EM_MNL.
Lastly, we considered other model selection criteria: the Akaike Information Criterion (AIC) and Bayes Information Criterion (BIC). Note that BIC penalizes extra parameters more heavily than does the AIC. Both the AIC and BIC have two terms: a negative log-likelihood and a penalty term penalizing a complex model (i.e., more parameters), which are Table . Primary demand estimations,X k j , for simulated data in Table  Periods  defined by where d is the number of parameters and |data| is the number of observations. The number of observations was T t=1 K k=1 j∈B k z k jt = 699. The number of parameters was 21 for the MNL model and 22 for the NMNL model. The AIC of EM_MNL was 323.0212 and that of EM_NMNL was 305.0092. The BIC of EM_MNL was 418.5639 and that of EM_NMNL was 405.1015. The values of AIC and BIC for the EM_NMNL were smaller than those obtained for the EM_MNL. Accordingly, one parameter increase of the NMNL model (i.e., μ) is justified.

Computational time and μ
We compared the computational time of EM_NMNL with that of EM_MNL. Both algorithms were implemented using Python (version 2.7), and we used Microsoft Windows 7 on a CPU with an Intel Core i5 processor and 4 GB of RAM. The computational time difference as the two algorithms depends on the true value of μ because EM_NMNL uses μ = 1 as an initial value (see Section 4.3) and decreases the value by δ = 0.05 in each iteration. If the true value of μ is close to one (i.e., closer to the EM_MNL), EM_NMNL terminates after a lower number of iterations. For this reason, we changed the value of μ from 0.1 to 1 (in increments of 0.1) and simulated the data for each μ. We used the , v B3 )) = ((1, 0.5, 0.1), (0.9, 0.4, 0.05)), λ = 50, and T = 15 (note that the market share changes with μ). Figure 2 shows the computational time (in seconds) of the two algorithms with respect to μ. The square marks represent the computational time of EM_MNL. The cross marks represent the computational time of EM_NMNL. For example, when μ = 0.1, the computational time is 2.35 seconds for EM_NMNL and 0.14 seconds for EM_MNL. When μ = 1, the computational time is 0.24 seconds for EM_NMNL and 0.12 seconds for EM_MNL. The line overlaid on the cross marks is a regression line with a slope of −2.29. When μ = 1, the computational time of EM_NMNL is twice that of EM_MNL. In this case, the stopping criterion of Step 3 in Section 4.3 is met after the decrease of μ from one (i.e., the initial value) to 0.95. As the slope of the regression line is −2.29, the computational time increases approximately 0.23 seconds for each increase of μ by 0.1 in this example.

Substitution patterns in NMNL model
The algorithm of EM_NMNL predicts the choice hierarchy based on the group-wise substitution rather than the stockout pattern. To see this, assume that there are two brands (brand A and brand B) and two types (type 1 and type 2). The preference vector is given by (v A1 , v A2 .v B1 , v B2 ) = (1, 0.5, 1, 0.5). We considered a homogeneous arrival rate λ = 50 over T = 15. We set exogenously that type 1 is more unavailable (i.e., one product type stocks out more frequently than others): product A1 is unavailable from period 10 to period 12 and product B1 is unavailable from period 13 to period 15. We simulated two data sets: one was from the brand-primary model with μ = 0.3 and the other was from the type-primary model with μ = 0.3. Table 6 shows the two simulated data sets.
Different substitution patterns were observed in the two data sets depending on the underlying chosen hierarchy. We observed the strong substitution pattern within the same brand for the data set from the brand-primary process, whereas the strong substitution pattern within the same type for the data set from the type-primary process.
We applied EM_NMNL to those simulated data sets, assuming that the chosen hierarchy was unknown. We first discuss the result of the data set from the brand-primary process in the upper part of Table 6. The incomplete data log-likelihood value is −118.5928 under the brand-primary model and −134.5083 under the type-primary model. Hence, the algorithm predicts that the data set follows the brand-primary process. Next, we discuss the result of the data set from the type-primary process in the lower part of Table 6. The incomplete data loglikelihood value is −131.4395 under the brand-primary model and −124.8578 under the type-primary model. Thus, the algorithm predicts that the data set is from the type-primary process.

Product assortment and primary demand estimation
Better accuracy in primary demand estimation would lead to a revenue increase through optimal assortment decisions and inventory management. In this section, we use the revenue model by Van Ryzin and Mahajan (1999) to illustrate the optimal assortment decision making.
We denote the number of purchases (or demand) of product j from group k as Z k j (S) under the assortment of S. We assume that Z k j (S) is normally distributed with mean λ × P k j (S; v, μ) and standard deviation λ × P k j (S; v, μ) (i.e., the Poisson distribution is approximated by a normal distribution).
Let p be the unit revenue for each sale and c be the unit cost. For computational simplicity for inventory decision making, we assumed that c = p/2. We denote the number of units stocked as x kj (inventory level). Then, the expected profit under assortment S is When c = p/2, the optimal procurement level is simply x * k j = λ × P k j (S; v, μ) (see Van Ryzin and Mahajan (1999)). Applying the optimal procurement level, we have The structure of the optimal assortment for the MNL model was studied in Van Ryzin and Mahajan (1999), and that for the NMNL model was performed in Kök and Xu (2011). For the MNL model, the optimal assortment is to simply choose the best i alternatives in terms of the preference weights. Similarly, the optimal assortment for the NMNL model is to choose the best i alternatives for each product group. For the numerical study, we used the simulated data set in Table 2. We assumed that p = $1. The optimal assortment under the MNL model is (A1, A2, B1, B2) with the procurement level of (x A1 , x A2 , x B1 , x B2 ) = (13.7, 7.7, 11.4, 6.3). The optimal assortment under the NMNL model is (A1, B1) with the procurement level of (x A1 , x B1 ) = (15.5, 14.6).
Using those procurement plans, we computed the expected profits under the true model parameter of (1, 0.5, 0.1), (0.9, 0.4, 0.05)), μ = 0.3, and λ = 50. In Table 7, we summarize the assortment plans (i.e., the expected number of purchases), the expected sales, and the expected remaining inventory after sales using EM_MNL , lower number of available products) by using the strong substitution within the product group. The total expected revenue decreases with lower diversity, but the inventory cost decreases more (i.e., the cost saving outweighs the revenue loss). The expected profit is $12.0138 under the MNL model and $13.1005 under the NMNL model. The store has a 9% profit increase by using the NMNL model rather than the MNL model.

Market share estimation
The demand estimation algorithm proposed in this article requires knowledge of the market share. In this section, we show the effect of market share on the estimation of the primary demand and discuss how to estimate the market share when it is unknown. In Section 3.2, we showed that if (v * , μ * , λ * ) is a maximizer of the incomplete data likelihood function of Equation (3), we can generate another maximizer, (v 0 , μ * , λ 0 ), where v 0 = αv * and for any α > 0. The market share is Thus, for any α > 0, we define which is the market share for the parameter set of (v 0 , μ * , λ 0 ). We let the parameter estimators in Table 3 be (v * , μ * , λ * ) and depict the market share in terms of log 10 α in Fig. 3. When α = 1, the market share is s(1) = 0.6919. When α = 2, the market share is s(2) = 0.7276. Thus, the effect of the market share can be equivalently analyzed by using α.
The primary demand estimation depends on the market share, and hence the value of α. The parameter set of (v 0 , μ * , λ 0 ) has the primary demand estimation The primary demand of no-purchase is given bŷ We now analyze Equations (9) and (10) under the full assortment and under the partial assortment. Under the full assortment (i.e., S t = B), we haveX k jt (α) = X k jt (1). This meansX k j1 = z k j1 and implies that demand estimation does not depend on α and, thus, the market share. However, the no-purchase estimation ofX 0t (α), as can be seen in Equation (10), depends on the market share. We depict the estimated values of no-purchase in period 1 in terms of the market shares in Fig. 4(a). It shows that the estimation indeed depends on the market share. If no-purchase in period 1 is observed by some means (e.g., CCTV, hired observers), Fig. 4(a) can be used to estimate the market share.
The demand estimation, unlike the full assortment, depends on the market share under the partial assortment. To show this, we estimated the primary demand of product B2 in period 15 and drew show a graph of it with respect to s as Fig. 4(b). For example, its estimate is 5.8 when s = 0.2854 and it is 4.2 when s = 0.9266. Although the market share changes from 0.2854 to 0.9266 (an approximately 325% increase), the demand estimate for B2 decreases from 5.8 to 4.2 (approximately a 27% decrease). In this example, the estimation changes depending on the market share, but the sensitivity does not seem high.
If no-purchase under the full assortment is not available to estimate the market share (as discussed earlier), another way to estimate it is using the sales data in periods having the full assortment (i.e., no stockout). First, we estimate the number of purchases using the estimated parameters for each market share, usingẑ k jt =λ t × P k j (B;v,μ). We then select the market share s at which the sum of squared errors is minimized. The sum of squared errors is 5 t=1 K k=1 j∈B k (z k jt −ẑ k jt ) 2 .
For example, the sales data in periods 1, 2, 3, 4, 5 of Table 2 were used to estimate s. In Fig. 4(c), the sum of squared errors is depicted with respect to s. That value is minimized when s = 0.99 (the true market share is 0.6919). The graph shows that any market share greater than 0.4 does not make much difference in estimating the primary demands under the full assortment.

Extensions and further discussions
We have assumed that all consumers use the same hierarchical purchase decisions: either the brand-primary process or the type-primary process. This can be extended to the mixture of the two models. An arriving consumer follows the brand-primary hierarchy with probability q B and the type-primary hierarchy with probability q T = 1 − q B . The primary demand approach can also be applied in this mixture model. The model description and the EM algorithm are in Online Supplement A. We present an interesting numerical example where a simple MNL model is better than the mixture model, although the data set is from the mixture model.
The primary demand estimation method for the NMNL model with two-stage choice hierarchy can be extended to the NMNL model with more than two stages. The model description and the EM algorithm for the NMNL model with three stages are in Online Supplement B. We present a numerical example with the simulated data set from the choice hierarchy with three stages.
The choice model based on utility U kj = u kj + ϵ k + ϵ kj in Section 3.1 results in the closed-form solution in the M-step. If there are factors influencing the utility, the utility can be written as u k jt = β T x k jt . For example, vector x k jt can contain p kjt (the price of product j from group k in period t), an individual consumer indicator variable, individual consumer characteristics, and product characteristics. The preference weight is given by v k jt = e u k jt . The primary demand approach can still be applied, but the nonlinear optimization algorithm needs to be applied to find optimal coefficients of β kjt in the M-step. If all of the factors are stationary (i.e., not dynamic so that u k j = β T x k j ), then we can apply the two-step approach proposed by Vulcano et al. (2012): In step 1, find the optimal preference weight vector ofv in the M-step. In step 2, use the regression method to find β solving the equation logv k j = β T x. This approach is suitable when the system of equations are over-determined.

Conclusions
We studied the primary demand-estimation problem for a heterogeneous-group product category that is sold in a department store setting. For such products, the hierarchical consumer choice model better represents consumers' purchasing behaviors. We considered two versions of hierarchical choice model: the brand-primary choice model and the type-primary choice model. For demand estimation, the NMNL choice model of hierarchical choice is combined with the non-homogeneous Poisson arrivals over multiple periods. The NMNL model better represents the different substitution rates among product groups, which cannot be incorporated into the MNL model. Our approach addresses how to estimate the NMNL model parameters, which are the inter-product-group similarity and preference vector along with the arrival rates. In the estimation, we assumed that sales data along with product availability and market share information are available. We applied the EM algorithm to estimate the demand parameters. In the M-step, there is no closed-form solution maximizing the conditional expected log-likelihood function developed in the E-step. To overcome this difficulty, we proposed a method whereby the EM algorithm is considered as a sub-module that is called repeatedly by an optimization problem in terms of the inter-product-group similarity. In this way, the revenue management system is relieved of the computational burden of using a nonlinear optimization package. In the present study, we compared the estimation result with that of the MNL model by Vulcano et al. (2012) for simulated data sets reflecting a high level of inter-product-group heterogeneity. With these simulated data, the NMNL model more correctly estimated the demand parameters. By contrast, in the primary demand estimation, the MNL model overestimated the demand for a product in the group of an unavailable product and underestimated the demand for a product in the other groups. This happened because the MNL model, unlike the NMNL model, does not consider the inter-product-group heterogeneity that leads to different substitution rates of product groups. The proposed approach also can be extended to the general NMNL model in Train (2009), where different levels of product-group scale parameters are used instead of the single scale parameter. Our preliminary analysis, not included in this article, showed that the optimal preference vector maximizing the conditional expected log-likelihood in the M-step can be obtained for a given set of scale parameters. However, a multivariate optimization problem thereby is incurred. The resolution of the multivariate optimization problem will be the focus of future research.