Parametric demand learning with limited price explorations in a backlog stochastic inventory system

Abstract We study a multi-period stochastic inventory system with backlogs. Demand in each period is random and price sensitive, but the firm has little or no prior knowledge about the demand distribution and how each customer responds to the selling price, so the firm has to learn the demand process when making periodic pricing and inventory replenishment decisions to maximize its expected total profit. We consider the scenario where the firm is faced with the business constraint that prevents it from conducting extensive price exploration, and develop parametric data-driven algorithms for pricing and inventory decisions. We measure the performances of the algorithms by regret, which is the profit loss compared with a clairvoyant who has complete information about the demand distribution. We analyze the cases where the number of price changes is restricted to a given number or a small number relative to the planning horizon, and show that the regrets for the corresponding learning algorithms converge at the best possible rates in the sense that they reach the theoretical lower bounds. Numerical results indicate that these algorithms empirically perform very well. Supplementary materials are available for this article. Go to the publisher’s online edition of IISE Transaction, for datasets, additional tables, detailed proofs, etc.


Introduction
Information about potential demand and customers' responses to price change is critical for making pricing and inventory decisions. However, in many applications, especially for new products, such information is not known a priori, and the firm needs to learn such information on the fly through, for example, price experimentation. In practice, however, firms may not want to conduct very frequent price experimentation, because frequent price changes can confuse customers, hurt brand reputation, and incur operational costs (e.g., the cost of changing price labels in brick-andmortar stores). See Cheung et al. (2017) for more discussions on this issue.
In this article we consider a dynamic joint pricing and inventory control problem over a finite planning horizon, where the firm has limited knowledge about the potential demand distribution and customers' responses to selling prices. The firm dynamically determines its inventory replenishment and pricing decisions in each period, subject to some constraint on the number of price changes, and the objective is to maximize its total expected profit. We consider the setting that the number of potential customers in each period follows a discrete, but unknown, distribution, and each customer's response (i.e., the probability to purchase) on an offered price is drawn from a family of distributions with unknown parameters. We develop data-driven algorithms to compute pricing and inventory replenishment decisions when there exist constraints on the number of price changes, and evaluate the performances of the learning algorithms by regret, which is the total profit loss compared with a clairvoyant who has complete information about the potential demand distribution and customer response to the selling price.
Specifically, we study the scenarios where the number of price changes is limited to either no more than a positive constant, or a small number compared with the length of the planning horizon (in the order of log T when the length of the planning horizon is T), and develop a learning algorithm for each case. We derive the regret for the learning algorithm and determine its dependency on the number of price changes allowed. For each scenario, we show that the regret of the proposed algorithm converges at the best possible rate in the sense that, its regret rate matches that of the lower bound of any learning algorithms for the respective problems. We also conduct numerical studies on these algorithms and show that they empirically perform very well.

Comparisons with the literature
This article is related to the research literature dealing with limited demand information in stochastic inventory control, revenue management, and joint pricing and inventory Our work belongs to the parametric category of joint optimization of pricing and inventory control. The most closely related works to ours are Besbes and Zeeevi (2009), Broder (2011), Broder and Rusmevichientong (2012, Cheung et al. (2017). Broder (2011) and Broder and Rusmevichientong (2012) consider dynamic pricing problems in which there is a single arrival in each period, and Broder (2011) further considers limited price experimentations. In our model, the number of arrivals per period is random, of which the distribution is not known and needs to be learned from data. Both Broder (2011) and Broder and Rusmevichientong (2012) assume infinite initial inventory and zero holding cost, hence there are no inventory replenishment decisions. In our model, the firm makes joint decisions for pricing and inventory control. As the inventories are non-perishable, some prescribed inventory level target (by a learning algorithm) cannot be achieved if it is lower than the carry-over inventories from the previous period. Therefore, the learning problem we study has a constraint due to carry-over inventories. See Huh and Rusmevichientong (2009) for more discussions on the impact of carry-over inventory on the performance of learning algorithms. Cheung et al. (2017) study a dynamic pricing problem with demand learning, which also consider limited price changes. As in the aforementioned literature, they also assume initial inventory is infinite, and there is exactly one customer in each period. Therefore, they have no inventory replenishment decision. The firm does not know the demand function, but knows that it belongs to a finite set of functions that is known. In contrast, we consider that the demand function is drawn from a parametric class of functions with unknown continuous parameters of dimension k. To learn the value of the parameters, we are faced with a set that has an infinite and uncountable number of elements (as opposed to the finite set in Cheung et al. (2017)). In terms of methodology, Cheung et al. (2017) apply first-order estimation (using sample average to estimate the expectation), whereas we employ Maximum Likelihood Estimation (MLE). The regret convergence rate for their model is Oð log ðmÞ TÞ if m price changes are allowed, and for our model, the best possible convergence rate (which we achieve) is OðT 1=ðmþ1Þ Þ. Besbes and Zeevi (2009) study the revenue management problem with fixed initial inventory by using both non-parametric and parametric approaches, thus there is no inventory decisions. For the parametric approach, they prove that the lower bound for regret of their algorithm is XðT 1=2 Þ. In their k-unknown-parameter case (which is similar to our kidentifiable case), they propose an algorithm with regret OðT 2=3 ð log TÞ 1=2 Þ; in their 1-unknown-parameter case (which is similar to our well-separated case), they obtain a regret of OðT 1=2 ð log TÞ 1=2 ð log log TÞÞ.

Structure of this article
In the next section we formulate the joint pricing and inventory replenishment problem. In Section 3 we present learning algorithms, for the well-separated case and the general case as well as their regret rates. In Section 4 we conduct a numerical study and report the numerical results. We conclude in Section 5. The proofs of Theorems 1 to 4 are provided in the online Appendix.

Model formulation
We consider a periodic-review stochastic inventory system over a planning horizon of T periods. At the beginning of each period t, the firm sets a selling price p t 2 ½p l ; p h and determines a replenishment decision, or order-up-to level, y t 2 Y ¼ fy l ; y l þ 1; :::; y h g; t ¼ 1; :::; T. During period t, a random number of potential customers D t arrive, and each potential customer purchases a product with probability rðp t ; zÞ, where z 2 Z is a parameter vector and Z is a compact and convex set, and rðp; zÞ is non-increasing in p for each z 2 Z. See Broder and Rusmevichientong (2012) for a justification of the purchasing probability model using a customer's willingness-to-pay. For convenience we refer to rðp; zÞ as the customer response probability (to price). The number of potential customers over the T-period planning horizon, D 1 ; D 2 ; :::; D T , are independent and identically distributed. Customers willing to purchase in a period are satisfied as much as possible by on-hand inventory, and unsatisfied demands are backlogged. Let h be the unit holding cost per period, b the unit backlog cost per period, and inventory ordering cost is normalized to zero. The inventory replenishment lead-time is assumed to be zero. The objective of the firm is to dynamically determine its pricing and inventory replenishment decisions in each period to maximize its total expected profit.
Let the potential demand distribution be denoted by P D t ¼ n f g¼ w n ; n ¼ d l ; d l þ 1; :::; d h ; where d l and d h are the lower and upper bounds, respectively, of the potential demand. For convenience we denote w ¼ ðw d l ; :::; w d h Þ. If the firm knows the distribution of D t as well as each customer's response probability rðp; zÞ (or more specifically, the true value of z), then this is a standard dynamic joint pricing and inventory control problem that has been studied extensively in the literature. However, in our setting, the firm knows neither the distribution of D t nor the customer response probability parameter z a priori.
In addition, the firm is faced with the business constraint that prevents it from conducting extensive price exploration. Thus, the firm is subject to constraints on the number of times it can change its selling price. In such scenarios, the firm has to develop a mechanism that learns the potential demand distribution and customer response probability parameter while satisfying the constraints on the number of price changes, and exploit the extracted information to maximize its total profit. To formulate the optimization problem, we let x t denote the inventory level at the beginning of period t before replenishment decision, and assume x 1 ¼ 0. Clearly, given p t ¼ p and conditioning on D t ¼ n, the number of customers who purchase the product in period t is a binomial random variable with parameters n and rðp; zÞ. Thus, if we let B(n, r) denote a generic binomial random variable with parameters n and r, then the total number of customers who purchase in period t can be written as BðD t ; rðp t ; zÞÞ. Given a pricing and inventory policy / ¼ ððp 1 ; y 1 Þ; ðp 2 ; y 2 Þ; :::; ðp T ; y T ÞÞ with y t ! x t , the total expected profit over the planning horizon is If the firm knows the distribution of D t and parameter z a priori, then dynamic programming can be used to find the optimal pricing and inventory replenishment decisions that maximize problem (1). If that is the case, it is known (see, e.g., Sobel (1981)) that a myopic policy is optimal for problem (1). Therefore, to maximize the T period profit in model (1), the firm only needs to solve a single-period optimization problem that maximizes Gðp; y; z; wÞ, where and D is the generic potential random demand. Assume Gðp; y; z; wÞ admits a unique maximizer ðp Ã ; y Ã Þ on P Â Y, and p Ã 2 ðp l ; p h Þ. Then the optimal policy / Ã for the T -period problem (1) under complete information without limit on the number of price changes is to implement ðp Ã ; y Ã Þ every period. Since the initial inventory x 1 ¼ 0; y Ã can be achieved every period. As it implements the same price p Ã every period, no price change is required, therefore / Ã is also the optimal policy with limited price changes. Let d t and s t denote the realized potential demand and sales in period t, then the firm needs to develop an adaptive policy p that determines the selling price p t and replenishment level y t ! x t based on historical information ððp 1 ; y 1 ; d 1 ; s 1 Þ; :::; ðp tÀ1 ; y tÀ1 ; d tÀ1 ; s tÀ1 ÞÞ that satisfy the constraints on the number of price changes and carry-over inventories. We use regret to measure the performance of a policy /, which is defined as the total profit loss of policy / compared with that of the optimal policy / Ã when complete information is available and there is no constraint on the number of price changes. That is, For any policy /, it holds that R / ðTÞ ! 0, and the better policy / performs, the smaller the regret.
We remark that in this model, both potential demand and sales data are observable. This assumption is appropriate in online setting in which the online retailers can track the total demand arrival as well as the realized purchasing decisions (hence the probability of purchasing). However, it may not hold in traditional retail stores, where the firm can only observe realized sales.
In the following section we present learning algorithms for the dynamic joint optimization of pricing and inventory control when the firm does not have information about potential demand distribution and parameters of customer response probability a priori, and it is subject to constraints on the number of price changes.

Learning algorithms and their regrets
To optimize profit, the firm needs to learn both the distribution of potential customer demand and the customer response to selling price (or parameter z). To estimate the true value of z, we employ the MLE method.
Recall that when the selling price in period t is p t , each potential customer purchases a product with probability rðp t ; zÞ. Let d t be the realized customer arrivals during period t and ðu t1 ; :::; u td t Þ the realizations of purchasing decisions of these customers, i.e., u tl ¼ 1 if customer l in period t purchases a product and u tl ¼ 0 otherwise, 1 l d t . Denote The likelihood function for having customer purchasing realizations between periods t 1 and t 2 is The maximum likelihood estimator for z, denoted byẑ, isẑ In what follows we first study a so-called well-separated case, in which the firm is either constrained by a given number of price changes, or the number of price changes is restrained to be infrequent (to be specified). Then, we consider a more general case. For each case, we provide the learning algorithm and the convergence rate of its regret.

Well-separated customer response
We first consider a well-separated case, which is defined as follows. Let p 2 ½p l ; p h and z 2 Z & R 1 , the probability distribution for customer purchasing decision u 2 f0; 1g is The family of distributions fQ p;z : z 2 Zg is called wellseparated if for any p 2 ½p l ; p h ; fQ p;z ; z 2 Zg is identifiable, i.e., Q p;z 1 ðÁÞ 6 ¼ Q p;z 2 ðÁÞ for z 1 6 ¼ z 2 . We introduce the following two assumptions for the well-separated case: For any p 2 ½p l ; p h and any z 2 Z, (i) there exists a constant c f >0 such that the Fisher information Iðp; zÞ, given by satisfies Iðp; zÞ ! c f , where we use U to denote a Bernoulli random variable with distribution Q p;z ; (ii) there exist constants r and r such that 0<r rðp; zÞ r<1. Examples that satisfy these assumptions include (i) P ¼ ½1=2; 2; Z ¼ ½1; 2 and logit customer response probability rðp; zÞ ¼ e Àzp =ð1 þ e Àzp Þ; (ii) P ¼ ½1=3; 1=2; Z ¼ ½3=4; 1 and linear customer response probability rðp; zÞ ¼ 2=3Àzp; and (iii) P ¼ ½1=2; 1; Z ¼ ½1; 2 and exponential customer response probability rðp; zÞ ¼ e Àzp . See Broder and Rusmevichientong (2012) and Chen, Jasin and Duenyas (2018) for more discussions on these conditions. First we consider the case that the number of price changes is limited to a given number, say m ! 1. To develop a learning algorithm for this case, we divide the planning horizon T into m þ 1 stages, of which the ith stage consists of I i ¼ dT i=ðmþ1Þ e periods, i ¼ 1; :::; m, and the ðm þ 1Þth stage contains the last TÀ P m i¼1 I i periods, where dxe represents the smallest integer greater than or equal to x. During stage i ! 2, the algorithm implements a solution that is constructed using data collected from the previous stage i -1. Then the algorithm uses the realized potential demand data and realized customer purchasing data during stage i to estimate the potential demand distribution and customer response parameter z, and solve a data-driven version of optimization problem for Equation (2). The resulting solution is implemented in the next stage i þ 1.
To get the algorithm started, we need some inputp 1 and y 1 2 Y as the initial pricing and inventory decision for the first stage.

Algorithm I
Step 1: Setting pricing and replenishment decisions For stage i ¼ 1; :::; m þ 1, set the selling price and inventory level to p t ¼p i ; t ¼ t i þ 1; :::; t iþ1 ; y t ¼ max x t ;ŷ i È É ; t ¼ t i þ 1; :::; t iþ1 : Step 2: Estimation For stage i ¼ 1; :::; m, estimate z and potential demand distribution w using the realized data in stage i, ;1 l d t Þ, as follows: andŵ i ¼ ðŵ id l ;:::;ŵ id h Þ, wherê Step 3: Data-driven optimization For stage i ¼ 1; :::; m, solve the data-driven optimization problem: Denote its optimal solution by ðp iþ1 ;ŷ iþ1 Þ, and go to Step 1. The learning algorithm above is explained as follows. Since price cannot be changed more than m times, the planning horizon is divided into m þ 1 learning stages. These stages are exponentially increasing in length, and the reasoning behind it is that, as more data are collected and more accurate estimates are obtained for the potential demand distribution and customer response probability, they should be used for longer times. The algorithm contains m þ 1 iterations, with iteration i implementing decisions for stage i and computing the decisions for stage i þ 1. That is, in each iteration, the algorithm implements the pricing and inventory decision obtained from an earlier iteration, then use the data collected to estimate parameter z and the probability distribution for potential demand. In particular, maximum likelihood method is used to estimate the parameter z, and empirical distribution is calculated for the potential demand. Then, this information is used to construct a data-driven optimization problem (4), and its optimal solutions are the pricing and inventory decisions to be implemented in the next iteration.
Theorem 1. There exists a constant c 1 >0 such that for any problem instance, the regret of learning algorithm I with at most m price changes is bounded by An important question is whether there exists a learning algorithm with m or fewer price changes that has lower regret rate than Algorithm I. The following result shows that it is not possible, at least when the times of price changes need to be determined at the start.
Theorem 2. There exist problem instances such that the regret for any learning algorithm that changes prices no more than m times according to a predetermined schedule is lower bounded by XðT 1=ðmþ1Þ Þ. That is, there exists a constant c 2 >0 such that for any learning algorithm / that has m or fewer price changes, The two theorems above show that our algorithm achieves the best convergence rate for the well-separated case with a fixed number of price changes of predetermined change schedule.
It is worth pointing out that, when there is no constraint on the number of price changes, another algorithm can be developed for our problem (see Theorem 3 below) that has regret Oð log TÞ. Therefore, the regret of the algorithm for the problem with at most m price changes increases significantly, from Oð log TÞ to OðT 1 mþ1 Þ. In the previous analyses we consider the case that the number of price changes is constrained up to a fixed number. In practice the limit on the number of price changes may not be so stringent and more price changes may be allowed for longer planning horizons. In the following, we consider the scenario that the firm is allowed to change the price Oð log TÞ times and we refer to this constraint as infrequent price changes. We develop a learning algorithm for the joint pricing and inventory control problem with infrequent price changes, and show that the regret improves significantly, from OðT 1=ðmþ1Þ Þ to Oð log TÞ, due to less stringent limitations on price exploration.
For the case with infrequent price changes, we design a learning algorithm that changes price Oð log TÞ times. To be more specific, given input parameters I 0 >0 and v > 1, the algorithm changes price for times. Similar to Algorithm I, the learning algorithm for this case also divides the time horizon into stages with exponentially increasing lengths, and charges the same price within each stage. However, the length of the learning stages is different, with the length of stage i; i ¼ 1; 2:::; N; to be and the last stage, N þ 1, has I Nþ1 ¼ TÀ P N i¼1 I i periods. Same as for Algorithm I, define t i to be the last period of stage i -1, i.e., P iÀ1 j¼1 I j ¼ t i ; i ¼ 2; :::; N þ 2, with t 1 ¼ 0. Thus, stage i starts in period t i þ 1 and ends in period t iþ1 . The algorithm runs in exactly the same manner as Steps 1 to 3 in Algorithm I, but with different stage length given in Equation (5), and the algorithm consists of Oð log TÞ iterations that increases in T.
Theorem 3. There exists a constant c 3 >0 such that the regret of the learning algorithm with Oð log TÞ price changes is bounded by R T ð Þ c 3 log T: We argue that Xð log TÞ is the lower bound for the regret of any algorithm for our problem with at most Oð log TÞ (or any number of) price changes. Broder and Rusmevichientong (2012) establish such a lower bound for a sepcial case of our problem, i.e., the dynamic pricing problem with infinite initial inventory (thus there is no inventory replenishment decision) and there is no constraint on the number of price changes, and they show that the regret for any algorithm of their problem is lower bounded by Xð log TÞ. As our problem is more general than theirs, the regret of our problem is also lower bounded by Xð log TÞ. Therefore, Theorem 3 indicates that our algorithm achieves the best possible regret rate for our problem in hand.

General customer response
We next consider a more general case where the parameter in the customer response probability rðp; zÞ is a k-dimensional vector, i.e., z ¼ ðz 1 ; :::; z k Þ 2 Z & R k for some integer k ! 1. To estimate z, we need at least k distinct prices for experimentation. For a set of prices p ¼ ðp 1 ; :::; p k Þ, with a slight abuse of notation we again define the probability for customer purchasing decisions u ¼ ðu 1 ; :::; u k Þ 2 f0; 1g k by The family of distributions fQ p;z : z 2 Zg is said to belong to a general case if there exist k price points p ¼ ð p 1 ; :::; p k Þ 2 ½p l ; p h k such that the family of distributions fQ p;z : z 2 Zg is identifiable, i.e., Q p;z 1 ðÁÞ 6 ¼ Q p;z 2 ðÁÞ for any z 1 6 ¼ z 2 in Z. To emphasize the dependency on k, in this case we shall also call the family of distributions k-identifiable, and we call p the exploration prices. We assume the following conditions for the general case: For any z 2 Z, (i) there exists a constant c f >0 such that k min fIð p; zÞg ! c f , where Ið p; zÞ denotes the Fisher information matrix given by where k min fIð p; zÞg is the smallest eigenvalue of the Fisher information matrix Ið p; zÞ, and U is a vector of k independent Bernoulli random variables with joint distribution Q p;z ; (ii) there exist constants r and r such that 0<r rð p j ; zÞ r<1 for 1 j k. These conditions are discussed in Broder and Rusmevichientong (2012) and Chen, Jasin and Duenyas (2018), and they are satisfied by a number of families of demand curves, such as (i) P ¼ ½1=2; 2; Z ¼ ½1; 2 Â ½À1; 1 and logit customer response probability rðp; zÞ ¼ e Àz 1 pÀz 2 = ð1 þ e Àz 1 pÀz 2 Þ; (ii) P ¼ ½1=3; 1=2; Z ¼ ½2=3; 3=4 Â ½3=4; 1 and linear customer response probability rðp; zÞ ¼ z 1 Àz 2 p; and (iii) P ¼ ½1=2; 1; Z ¼ ½1; 2 Â ½0; 1 and exponential customer response probability rðp; zÞ ¼ e Àz 1 pÀz 2 . These conditions ensure that we can estimate the customer response parameter based on purchase observation at prices p by the MLE method. See Besbes and Zeevi (2009) and Chen, Jasin and Duenyas (2018) for more discussions on these conditions and their implications.
To learn the k-dimensional customer response parameter z, we use k price changes in the learning algorithm. The algorithm, described below, divides the planning horizon T into two phases. In the first, or exploration, phase, we experiment the k exploration prices p 1 ; p 2 ; :::; p k , where p ¼ ð p 1 ; :::; p k Þ is such that Q p;z is identifiable, together with an initial order-up-to level y. The exploration phase consists of kI periods, where I ¼ dT 1=2 =ke, and it is divided into k stages, with each stage having I periods. Thus, each of the k prices p 1 ; :::; p k is experimented for I periods. At the end of the exploration phase, we use the collected data to compute an empirical probability mass function for the potential customer demand and to estimate the parameter z. These are then used to construct an empirical objective function that is optimized to find the pricing and inventory control decisions to be implemented in the second, or exploitation, phase.

Algorithm II
Step 1: Exploration Set the pricing and order-up-to levels, for i ¼ 1; :::; k, to p t ¼ p i ; t ¼ iÀ1 ð ÞI þ 1; :::; iI; y t ¼ y; t ¼ iÀ1 ð ÞI þ 1; :::; iI: Step 2: Parameter estimation Estimate the potential demand distribution and compute the maximum likelihood estimator of z by realized potential customer demand data d ½1;kI ¼ ðd t ; t ¼ 1; 2; :::; kIÞ and customer purchasing decisions u ½1;kI ¼ ðu tl ; t ¼ 1; 2; :::; kI; l ¼ 1; 2::::; d t Þ at prices p ½1;kI ¼ ðp t ; t ¼ 1; 2; :::; kIÞ, as follows: andŵ ¼ ðŵ 1 ; :::;x n Þ, wherê Step 3: Data-driven optimization and exploitation Solve the data-driven optimization problem: max p;y ð Þ2½p l ;p h ÂY G p; y;ẑ;ŵ ð Þ ; and let the optimal solution be ðp;ŷÞ. For periods t ¼ kI þ 1; :::; T, set the pricing and inventory level to p t ¼p; y t ¼ max x t ;ŷ f g: The following theorem presents the regret rate of the profit of the learning algorithm compared with the true maximum total profit when complete demand information is available to the firm. Note that in the learning algorithm described above, the pricing decisions are changed k times in order to adequately learn the k unknowns in parameter z. If less than k price changes are allowed, then we will never be able to learn all the k elements in z and the regret will be linear in T. If, on the other hand, k or more price changes are allowed, then our algorithm above yields a regret of OðT 1=2 Þ, which matches the lower bound regret rate XðT 1=2 Þ for learning algorithms of this class of problem. Indeed, we can show that even if both the pricing and inventory decisions are allowed to change in each and every period (i.e., there is no constraint on the number of changes in pricing decisions), the regret rate for any learning algorithm is lower bounded by XðT 1=2 Þ. This lower bound is established in Broder and Rusmevichientong (2012) for a dynamic pricing problem with infinite initial inventory, but it also holds in our more general setting with joint pricing and inventory replenishment decisions.
The feasible region for order-up-to level is Y ¼ f0; 1; 2; 3; 4g. The demand response function r(p, z) and feasible region P for selling price p are described below.
We use the following standard to evaluate the performance of the algorithm. We consider the percentage profit loss per period compared with the optimal profit of the complete information problem, which is R T ð Þ G p Ã ; y Ã ; z; w ð Þ T Â 100%: We compute the percentage profit loss per period over 500 rounds, then report the average value in Table 1.
From Table 1, it is seen that, when T ¼ 10 3 most of percentages of profit loss are below 10% with three exceptions, and performance improves to around 1% when T ¼ 10 5 . For the well-separated case, within each column, it is seen that system performance has diminishing effect, that is, when there are initially very few price changes allowed, significant improvement can be achieved by adding one more price change, but improvement decreases when more and more price changes are allowed.

Conclusion
We consider a dynamic joint pricing and inventory control problem in which the firm has little or no prior knowledge about the distribution of potential customer demand in each period or customer response to its selling price, and in addition, the firm is subject to business constraints for extensive price exploration. We consider several scenarios and develop learning algorithms that satisfy the constraints on the number of price experiments. We obtain the regrets for these learning algorithms and show that they achieve the best possible convergence rates in the sense that, they reach the lower bound for the regrets of any algorithms for the respective classes of problems. Numerical results show that the algorithms perform very well.
In most real-world applications it is unlikely that the firm has complete information of the distribution of customer demand, hence learning is an important task for the firm's decision-making process. In this study we consider the scenario where the customer responses are drawn from a parametric class of distributions. It is interesting future work to explore the impact of the number of price changes on a non-parametric model.
The inventory setting considered in this article is a backlog model, i.e., demand (the customers who made purchases) that cannot be satisfied by on-hand inventory is backlogged. One important assumption made in this article is that all potential customers are observed, regardless of whether they purchased products or not. Although this assumption may be plausible in some applications (e.g., brick-and-mortar stores), it is conceivable that in some settings the retailer is unable to observe or identify customers who did not make a purchase due to a high selling price. Another direction for future research is to consider lostsales, i.e., customers that cannot be satisfied immediately are lost, and for this case there will be again two scenarios, based on whether or not the retailer can observe the lost customers, and the latter case leads to censored demand data. These all remain interesting potential directions for future research.