From Amazon to Apple: Modeling Online Retail Sales, Purchase Incidence and Visit Behavior

In this study we propose a multivariate stochastic model for website visit duration, page views, purchase incidence and the sale amount for online retailers. The model is constructed by composition from carefully selected distributions, and involves copula components. It allows for the strong nonlinear relationships between the sales and visit variables to be explored in detail, and can be used to construct sales predictions. The model is readily estimated using maximum likelihood, making it an attractive choice in practice given the large sample sizes that are commonplace in online retail studies. We examine a number of top-ranked U.S. online retailers, and find that the visit duration and the number of pages viewed are both related to sales, but in very different ways for different products. Using Bayesian methodology we show how the model can be extended to a finite mixture model to account for consumer heterogeneity via latent household segmentation. The model can also be adjusted to accommodate a more accurate analysis of online retailers like apple.com that sell products at a very limited number of price points. In a validation study across a range of different websites, we find that the purchase incidence and sales amount are both forecast more accurately using our model, when compared to regression, probit regression, a popular data-mining method and a survival model employed previously in an online retail study.


Introduction
Sales conversion rates for online retailers are notoriously low (Moe and Fader 2004;Venkatesh and Agarwal 2006), with Lin et al (2010) estimating it at 2.3%.This contrasts markedly with offline retailers, who have much higher conversion rates (Moe and Fader 2004).While an online nonsale incurs little monetary cost, there are many businesses that sell exclusively online (e.g., amazon.com, expedia.com and orbitz.com)that would be frustrated by low online conversion rates (Lin et al. 2010).Moreover, even retailers that have both an offline and online presence can realize advantages from increasing their online sales.At the heart of increasing online sales is developing a website that engages the consumer.A simple and oftenused measure for engagement with a website is the duration of a visit, sometimes referred to as "stickiness," which has been linked to online retail profits (Bucklin and Sismeiro 2003;Johnson et al. 2003;Venkatesh and Agarwal 2006).Another related stickiness measure is the number of page views (Danaher et al. 2006), which Manchanda et al. (2006) show is positively related to higher repeat purchase rates by online consumers.
Previous studies have linked duration with purchase incidence (Moe and Fader 2004;Montgomery et al. 2004; Van den Poel and Buckinx 2005), page views with purchase incidence (Manchanda et al. 2006;Van den Poel and Buckinx 2005), both duration and page views to purchase incidence (Lin et al. 2010) and duration to sales amount (Danaher and Smith 2011).However, what is missing from the literature is a simultaneous analysis of purchase incidence, sales amount, visit duration and number of pages viewed.We address this here by developing a quad-variate stochastic model.The marginal distributions of these variables are very different, and the dependence structure highly nonlinear, so that there is no known suitable multivariate distribution to employ.Therefore, we construct one by composition, where the component distributions are fleixble and shown to fit well empirically.
One component is the bivariate distribution of sales and duration, conditional on page views and purchase incidence, which is captured by a bivariate Gaussian copula model.While copula models are used widely in multivariate modeling (Nelsen 2006;McNeil, Frey and 1 Embrechts 2005), they have only recently been employed in marketing models; see Danaher and Smith (2011), Glady, Lemmens and Croux (2010) and Park and Gupta (2012) for examples.An advantage of our model is that it can be estimated rapidly using maximum likelihood, even for the very large sample sizes that can occur in online retail studies.In our study we employ data from a panel of approximately 100,000 internet users in the United States, whose internet activity was observed continuously over 2007.We fit our model to datasets from nine of the largest US retail sites, and use this to address a number of research questions.
First, we determine the impact of the website stickiness measures on sales.To do this we derive analytically from our stochastic model the expected sales and probability of purchase, conditional on one or both of visit duration and page views.These conditional expections are difficult to estimate directly using regression style models, both because they are highly nonlinear, and because the four variables are determined simultaneously.We also show how our model can be adjusted to account for online retailers with products at a very limited number of price points, such as apple.com,where the primary sale item is a $0.99 song.
Second, in a validation study we investigate the extent to which our model can be used to to forecast both purchase incidence and sales amount, given the website visit variables.
For all nine websites examined the forecasts from our model prove to be more accurate using both stickiness measures together, compared to either alone, and a naïve benchmark that uses neither.We also compare the accuracy of forecasts from our model to those from a linear regression and probit model using the stickiness measures as covariates, a popular data-mining method and an adaptation of the survival model developed by Manchanda et al. (2006) for purchase incidence.Throughout, we find our method dominates these alternatives in a statistically significant fashion.
Third, we demonstrate the versatility of our model by using it to investigate empirically the relationships between the visit and sales variables, both within and across different product categories.We examine the relationship for three websites in each of three product categories: books and digital media, travel services and apparel.Our model reveals that the relationship between sales and website visit duration and page views is complex, nonlinear and occurs in a form that is not readily decomposed in an additive fashion.A number of strong similarities in the relationships for retailers with the same product categories are uncovered.For example, even though amazon.com'ssales are consistently higher than barnesandnoble.combecause of higher basket totals, the purchase probability as a function of visit duration and page views is nearly identical at these two websites.However, there can be strong differences across product categories.For example, there is evidence that consumers research products online first, before buying online later, when buying apparel, but not books and digital media.We also find that apple.com is the only website among those in our study where expected sales amounts for purchases decrease when the visit duration is longer than 4 minutes; the relationship is monotonically increasing for all other websites.This reflects the unique goal-directed behavior of customers who visit the apple.comsite.
Last, we extend our model to a latent class finite mixture model, where the mixture segments are quad-variate distributions of the type developed here.Estimation is via Bayesian Markov chain Monte Carlo methodology and the number of latent segments is selected using a variant of the Deviance Information Criterion that is robust to label-switching.We find that latent segmentation at the household level improves forecast accuracy for two websites in the validation study, but either does not affect or decreases accuracy for six other sites.However, using the example of oldnavy.com,we show that there is strong evidence of consumer heterogeneity in purchase patterns.We identify three market segments that exhibit features that are consistent with the taxonomy of online consumer behavior discussed by Moe (2003) and Moe and Fader (2004).

Modeling Website Visits and Online Sales
In this section we first introduce the data used in the empirical analysis.We then develop and motivate our proposed stochastic model, highlighting some of its advantages over alternative approaches.We show how to estimate the model, both in the case where sales amounts are considered continuously distributed, as well as for the case where retailers, such as apple.com,offer products mostly at only a few price points.From the model we derive the expected sales amount and the probability of a purchase, conditional on pages viewed and duration of a visit.Last, we use these to derive predictions in a validation study, which we later show dominate forecasts from a number of alternative methodologies.

ComScore Data
The data used in this study were collected during 2007 by comScore and made available by subscription via the Wharton Research Data Service (WRDS).The database comprises a subsample of approximately 100,000 members of a panel of over two million internet users from across the United States.Members of the panel have software installed on individual machines in a household which passively captures all website visitation and online transaction activity.The smaller subsample is formed in a proprietary manner based on demographic variables so that it is representative of the wider population of US internet users.The domain names of websites visited are recorded, along with the total number of page views (P ) at each domain visited and the total duration (D) of the visit to a particular website.
An indicator denotes whether each visit results in a purchase (B = 1), or not (B = 0).If one or more items are purchased during the visit, the total sale amount (S) for the basket is also recorded.The data are collected at machine level and present an opportunity to investigate how the relationship between online transactions and visit behavior varies within and across online retailers.
Table 1 gives the top twenty online retail websites during 2007, ranked both by total sale amounts and by the total number of purchases.Unsurprisingly, amazon.com is the topranked website for total sales, yet is ranked second behind apple.comfor the total number of purchases.However, as we see later, a substantial number of purchases at apple.com correspond to purchases of a single song at the relatively low amount of $0.99, and so the website is only ranked twentieth by total sales.The table reveals that online retail activity during 2007 was dominated by sales in apparel, print and digital media, travel services, shipping services, photo processing, computing and electronic equipment, homeware and health and beauty products.We examine the first three of these product categories in detail in Section 3, and select the three major websites in each category.These are listed in Table 4, along with the sample sizes for each site, which are large; for example, there are 407,805 visits to amazon.com, and 268,437 to apple.com.However, this represents less than 5% of the activity at these sites of the full comScore panel, and less than 0.1% of activity at these sites by all US internet users.

Stochastic Model
We model the distribution of the two website visit variables, duration time, D > 0, and number of page views, P ∈ {1, 2, 3, . ..}, joint with the purchase indicator, B ∈ {0, 1}, and the sale amount S ≥ 0. At first glance it might appear that each margin can be modeled separately, with dependence captured by a four-dimensional copula function.Copula modeling is a popular method for constructing multivariate distributions in statistical (Nelsen 2006), econometric (Trivedi and Zimmer 2005) and financial (McNeil, Frey and Embrechts 2005) analysis, although they have only been employed recently in the marketing literature (Danaher and Smith 2011;Park and Gupta 2012).They permit the combination of univariate marginal distributions that need not come from the same distributional family, yet still allow for dependence among variables.However, in this situation direct use of a copula model has limitations.First, the sale amount is zero when no purchase is made in a visit, so that S is degenerate at 0 when B = 0, and this cannot be accounted for using existing four-dimensional parametric copula functions.Second, the popular Gaussian and t copulas (McNeil, Frey and Embrechts 2005;p.191)only have 6 and 7 parameters, respectively, while most Archimedean copulas only have a single dependence parameter.This proves insufficent to capture the dependence structure between these four variables, which we show is more nuanced in our empirical work.
Instead, we construct the joint distribution via composition as F (S, B, D, P ) = F 1 (S, D|B, P )F 2 (B|P )F 3 (P ) , (2.1) from the component distributions F 1 , F 2 and F 3 .This has a number of advantages.First, with an appropriate choice of component distributions, nonlinearities and other complexities in the dependence between the sales and website visit variables can be uncovered.We derive the purchase probability and expected sales amount, conditional on the visit variables, to provide insight into these relationships.Second, the degeneracy of S at 0 is easily accounted for by modeling F 1 differently when B = 1 or B = 0. Third, we identify parametric distributions for each component in equation ( 2.1) that fit our website data well, distributions that have been used to model similar variables previously.Furthermore, some of the components can be modeled semiparametrically, a property that will be exploited to handle retailers that make most sales at a small number of discrete price points.Fourth, estimation of the parameters of the distribution in equation (2.1) is straightforward using maximum likelihood.
Last, because equation (2.1) is a fully-specified joint distribution, it can be readily extended to account for market segmentation, as we show in Section 4.
In building our model we first select an appropriate distribution for the total number of page views per visit, denoted F 3 .A popular model for page views is the negative binomial distribution (NBD), which has been used previously by Danaher (2007) and Huang and Lin (2006).The distribution can be derived as a Gamma-Poisson mixture and is robust to observation-level heterogeneity.In our database, each observation corresponds to a visit to a retailer's website, where at least one page is viewed, so P ≥ 1.To account for this we truncate the NBD by removing the zero case from the probability mass function.
The remaining two distributions F 1 and F 2 in the decomposition of equation (2.1) are both conditional on the number of page views.We account for this by making the parameters of the two distributions functions of P .In particular, we partition P into contiguous intervals P1 , P2 , . . ., PK that cover the range of P , and allow the parameters of F 1 and F 2 to differ in each partition, so they are step functions with respect to P .We select the partition cut points for each website so that there are approximately an equal number of visits that result in a purchase within each partition.We select the numer of partitions using 5-fold cross-validation as discussed in Section 2.3.In our empirical work we show that this greatly enhances the quality of fit, as well as substantially improving prediction accuracy.Conditional on the page view partition, we model the purchase indicator, F 2 (B|P ∈ Pk ), with a Bernoulli distribution.
The bivariate distribution of sale amount and duration of visit, denoted F 1 , differs depending on whether or not a purchase occurs during the visit.When there is no purchase (B = 0), the bivariate distribution is degenerate at S = 0, so that F 1 (S = 0, D|B = 0, P ∈ Pk ) = F 1D (D|B = 0, P ∈ Pk ).The distribution F 1D (D|B = 0, P ∈ Pk ) is univariate and relates only to the website duration, and is well-modeled as an Inverse Gaussian.This distribution was identified as optimal using AIC and BIC from a list of alternative distributions that also included the Gamma, Weibull, Log-Logistic, and Log-Normal.It has also been used previously to model duration over heterogeneous populations (Hougaard 1984;Johnson, Kotz and Balakrishnan 1994, p.291), which is precisely the situation here.
However, when a purchase does occur, so that S > 0, then F 1 is a bivariate distribution which we model using a copula model, as now detailed.We label the two univariate distributions as F 1S (S|B = 1, P ∈ Pk ) and F 1D (D|B = 1, P ∈ Pk ), which makes explicit the conditioning on purchase incidence and page views.We "couple" these together using a bivariate copula function C with dependence parameter θ k that varies over partition, so that We employ the bivariate Gaussian copula, is an inverse standard normal distribution function and Φ 2 is the distribution function of a bivariate normal distribution with zero mean, unit marginal variances and correlation −1 < θ < 1.The copula only accounts for any dependence between the two variables; with positive dependence when θ > 0, negative dependence when θ < 0 and independence when θ = 0; see Song (2000) for a discussion of the Gaussian copula.Other bivariate copulas C can also be employed easily here, with comprehensive lists given by Nelsen (2006), McNeil et al. (2005) and Trivedi and Zimmer (2005).The Gaussian copula has symmetric tail dependence (McNeil et al. 2005, pp. 208-222), which proves an unrealistic assumption in risk management applications.To investigate if this is also unrealistic here, we considered a copula constructed as a linear combination of a Clayton, Gumbel and Gaussian copulas.The Clayton copula allows for lower tail dependence, and the Gumbel for upper tail dependence.
When estimated using maximum likelihood for the data from amazon.com,we found minor or no meaningful tail dependence for any page view partition, and therefore retain the Gaussian copula for its robustness and simplicity.
For the univariate distribution of visit duration we again employ an Inverse Gaussian, but for sales amount we employ a Log-Logistic distribution.These were identified as optimal, using AIC and BIC, from alternatives that included the Gamma, Inverse Gaussian, Weibull, Log-Logistic and Log-Normal distributions.We note that the Log-Logistic has been used to model sales amount previously by Oyer (2000).
For apple.com, the 77,398 recorded sales occur at only 151 non-zero price points, with 87.33% of all purchases being for exactly $0.99, which corresponds to the sale of a single song from the iTunes store.The next most popular price point, representing 4.76% of total purchases is $9.99, and corresponds to the purchase of an album.Clearly, the sales amount S does not follow a Log-Logistic distribution, or any other well known parametric distribution.
For this retailer, we therefore employ the empirical distribution function (EDF) for S in each page view partition, giving an estimated distribution function F1S (s|B = 1, P ∈ Pk ) for each k.This is a nonparametric estimator for each ordinal-valued distribution, with values at all the unique observed price points.The ability to combine parametric copula functions with one or more nonparametric marginal distributions is widely considered a strength of the copula approach to constructing bivariate distributions (Shih and Louis 1995).
Table 2 lists the component distributions in the model, including their probability density or mass functions and unknown parameters.For the websites other than for apple.com,there are 8 parameters for each page view partition and 2 for the modified NBD for the number of page views itself, resulting in 8K + 2 parameters in total.

Estimation
When parametric distributions are used for the components in equation (2.1), each component can be estimated using maximum likelihood, and the resulting point estimates also maximize the joint likelihood.The parameters of the distributions of F 1 and F 2 are estimated separately for each partition, and F 1 also for the two values of B. When B = 1, estimation of the bivariate copula model in equation ( 2.2) is also straightforward using maximum likelihood, as outlined in Cherubini et al. (2004;pp.154-156).The ease with which copula models can be estimated is one reason for their popularity.
In the case of discrete pricing, estimation of the bivariate distribution F 1 differs because F 1S (S|B = 1, P ∈ Pk ) is modeled as a nonparametric discrete distribution using the EDF.
The parameters of F 1D (D|B = 1, P ∈ Pk ) are estimated using maximum likelihood, as is each copula parameter θ k , but conditional on the two estimated margins as follows.For each page view partition, given the EDF for S and estimated Inverse Gaussian distribution for duration D, the likelihood can be calculated as follows.Let (s i , d i ) be the ith observation of the pair (S, D), u ) be the left hand limit of the step function F1S at s i .Then, the mixed density of (S, D) is obtained by differencing with respect to the discrete-valued S, and differentiation with respect to D. The conditional likelihood for partition k is where C(u 1 |u 2 ; θ) = ∂ ∂u 2 C(u 1 , u 2 ; θ), f 1D is an Inverse Gaussian density, and the product is taken only over observations i where sales are made (B = 1) and with page views in partition k.This likelihood is easily maximized with respect to −1 < θ k < 1 by grid search.This approach to estimating a bivariate copula model is widely called 'inference for margins', which Joe (2005) shows provides consistent estimates with only a very minor reduction in efficiency compared to full maximum likelihood when the marginal distributions are known.
Last, we select the number of partitions K for each website using 5-fold cross-validation (CV).For each fold j, the metric is the score , where the summation is over the observations in fold j, and the mixed density is from the model fitted using the other four folds; the overall score is CV = 5 j=1 CV j .The mixed density is computed as f (S, B|P, D) = f (S|B, P, D)f (B|P, D), where f (S|B, P, D) is derived from the copula model for F 1 , and f (B|P, D) is computed as in equation ( 2.3) below.The last column of Table 4 gives the optimal number of partitions in the range 1 to 30 identified by this approach for all nine websites examined in our study.

Putting the Model to Use
Our primary aim is to understand the impact of the visit variables on both sales and purchase incidence.A simple, yet useful, summary measure is the overall (or "marginal") level of pairwise dependence between all six pairs of variables.Spearman's rho is an appropriate measure of dependence even when the margins are far from Gaussian, and can be computed for the distribution at equation (2.1) by first simulating many iterates from this distribution, and then computing the correlation coefficient of the ranked iterates.To simulate an iterate, first simulate P ∼ F 3 , then B ∼ F 2 and then (S, D) ∼ F 1 .For the latter, if B = 0 then S = 0 and only D needs generating; while if B = 1 then the pair are generated from a bivariate Gaussian copula as outlined in Cherubini et al. (2004, p.181).The pairwise dependence between S and D, conditional on a purchase made (B = 1) and page views P , can also be computed for the bivariate Gaussian copula, with Spearman's rho ρ C k = (6/π)arcsin(θ k /2).
We label this with a superscript "C" to distinguish it from the marginal Spearman's rho.
The impact of page views and duration can be assessed by evaluating the expected sales amount and probability of purchase, conditional on the visit variables as follows.The probability of a purchase is Here, f 1D (D|B = 1, P ) and f 1D (D|B = 0, P ) are the two Inverse Gaussian densities computed at point D, and Pr(B|P ) is the Bernoulli purchase probability.
The expected sales E(S|D, P ) = sf (s|D, P )ds is obtained via univariate numerical integration, except in the case of apple.comwhere the discrete domain of sales is summed over instead.The density (or mass function for apple.com)f (s|D, P ) can be derived analytically from the components as follows.First, note that where f (s|B = 0, D, P ) is a point mass of 1 at s = 0 and Pr(B = 1|D, P ) is given at equation (2.3).For continuously distributed sales amounts, where f 1 , f 1D and f 1S are the density functions of distributions F 1 , F 1D and F 1S , and ∂u∂v is the copula density.The copula density for a bivariate Gaussian copula is given in Table 2.For sales amounts with a discrete distribution where b = F 1S (s|B = 1, P ) and a = F 1S (s − |B = 1, P ) is the left hand limit of F 1S at s.
An advantage of computing the conditional expectation E(S|D, P ) and probability Pr(B = 1|D, P ) from equation (2.1), rather than modeling them explicitly, is that it captures the contemporaneous and nonlinear dependence between all four variables.We use the conditional expectation and purchase probability derived above in our empirical work in Section 3.
We also use them to make predictions in a validation study to further motivate our model, as discussed below.

Model Validation
We demonstrate that our model improves prediction of sales, conditional on both visit variables, compared to some alternative approaches.These include a naïve forecast as a benchmark, linear modeling, a popular data-mining method and a survival model from the marketing literature.We also show that forecasts constructed using both visit variables are more accurate than those that use only one.
We apply these approaches to the nine websites examined in detail in Section 3. The holdout sample size n f = 5, 000 for seven sites, and n f = 10, 000 for the two sites with the largest number of observations, amazon.comand apple.com.The holdout sample is stratified with respect to sales amount S, so that it is more representative than a simple random sample, although the conclusions are the same in either case.We fit our model to the data for each website and compute the probability of a purchase bi , and expected sales ŝi , for each observation i in the holdout sample using the expressions in Section 2.4.These are used to predict purchase incidence and sales amount.We label our method 'SM', and compare it to alternative methods which construct forecasts as follows: • Naïve (N): The historical purchase incidence ( bi = B), and the average sales amount ŝi = S.
• Linear Model (LM): A probit model for purchase incidence, and a regression model for the logarithm of the sales amount of transactions, both with D and P as linear covariates.The probability of a sale bi is computed from the probit model, and expected sales as ŝi = E(S|B = 1, D i , P i ) × bi , where E(S|B = 1, D i , P i ) is the expected sales for a transaction obtained from the regression model.
• CART: We employ the popular "Classification and Regression Tree" data-mining method with D and P as input variables; once for purchase incidence, and a second time for the logarithm of sales amounts of transactions.Forecasts for sales amount for all visits are then computed by multiplication in the same manner as for the linear model.
• Survival Model (M06): This is an adaptation of the basic survival model proposed by Manchanda et al. (2006) for online purchase incidence.These authors employ a proportional hazards model with an advertising exposure measure as a covariate, a step function for the baseline hazard, and only observe purchases.In comparison, we employ page views as a covariate, an empirical distribution function for the baseline hazard (ie. a Cox proportional hazards model) and employ both purchase and nonpurchase visits.
• SMP: This is a trivariate variant of our model excluding duration, so that forecasts are constructed conditioning only on page views.
• SMD: This is a trivariate variant of our model excluding page views, so that forecasts are constructed conditioning only on duration.
For each method we compute the root mean square error (RMSE), both for predictions of purchase incidence and sales amount, except for M06 which only provides predictions of purchase incidence.A number of conclusions can be drawn from the results.First, our proposed model (SM) provides forecasts that are more accurate than the naïve benchmark, N, and the competing methods LM, M06 and CART.The differences are significant for both forecasts of purchase incidence and sales amount, except for CART forecasts for apple.com.Second, it is clear that duration of a visit and the number of page views are useful in forecasting sales incidence and amount, with N dominated, particularly for retailers of apparel.Third, including both D and P as predictors improves forecast accuracy over-and-above using either separately.Fourth, Table 3 also includes validation results for a mixture model, but we discuss these in Section 4 where it is introduced.Last, we note that small improvements in forecast accuracy for each visit have the potential to make a substantial impact because of the very large number of visits.For example, our panel of approximately 100,000 users made over 407,000 visits to amazon.com in 2007, so the total number of visits by U.S. internet-enabled households to this website alone would number in the hundreds of millions during that year.

Empirical Analysis
We now demonstrate the usefulness of our stochastic model for investigating the relationship between the website visit variables, purchase incidence and sales for major online retailers in three product categories: books and digital media, travel services and apparel.We showcase apple.comseparately, given its unique online retail product assortment and its highly discrete pricing structure.

Books and Digital Media
The first website we examine is amazon.com,which was the world's largest online retailer by total sales in 2007.Amazon.comsells products in a wide variety of classes, but has a particular focus on books (47% of sales) and digital media products, such as DVDs and CDs of movies and music (28% of sales).There are 407,805 visits by comScore panelists to amazon.com in our data, with 31,851 (7.81%) of these visits resulting in a purchase.Table 5 contains the page view partitions, and the number of observations in each page view range.
Table 6 reports the parameter estimates and 95% confidence intervals for the component distributions in equation (2.1), and it can be seen that all the parameters vary substantially over the page view ranges.From the estimates for the distribution F 2 , the probability of a purchase being made when the number of page views is between 1 and 7 is low at Pr(B = 1|1 ≤ P ≤ 10) = 0.6%.As one would expect, this increases monotonically with the number of page views, so that for visits with between 76 and 99 page views the probability of a purchase rises to Pr(B = 1|76 ≤ P ≤ 99) = 30.9%.For each bivariate Gaussian copula we also compute Spearman's rho for the dependence between sale amount and visit duration, assuming a purchase occurs.This is also reported in Table 6, and is positive for all but one page view partition, with the largest value being 0.106.
Thus, it might initially appear that dependence between the duration of a visit and the sale amount is quite low.However, Table 7 reports the matrix of pairwise marginal Spearman correlations, and that between duration and sale amount is ρS,D = 0.259.To illustrate the usefulness of our model, Table 7 also contains the marginal Pearson sample correlations, which differ from the Spearman correlations because they do not take into account the highly non-Gaussian distribution of the variables.The Pearson correlations understate substantially the relationship between both visit variables (D and P ) and the sales amount (S).
We also compute expected sales, conditional on both duration and page views.Figure 1 plots the results as a three dimensional surface "sliced" at the mid-point of each page view partition.As the number of page views increase there is a corresponding increase in the expected sale amount.However, the same is not true for duration, with there being a visit duration that results in a maximum level of expected sales for each page view level.The relationship of the two visit variables with expected sales cannot be represented in an additive fashion, and is instead a bivariate nonlinear iteration.
Figure 1 shows that the link between the two visit variables and expected sales is very different for each variable.We examine this further by computing expected sales, marginalizing, respectively, over page views and duration.These expectations can be computed in  In Section 4 we show how our model can be extended to incorporate latent segmentation household-level heterogeneity.
Barnes and Noble are the second largest online book retailer in the U.S., so we also look at their website bn.com for comparison.The site offers a product range based primarily around books, in contrast to Amazon's broader offering.In 2007 Barnes and Noble had approximately 9% of the traffic and 6% of the total sales of the much larger retailer Amazon.
Nevertheless, Table 5 shows that the partitions for the page view deciles for these two retailers are broadly similar.
Figure 2 also plots expected sale amount and probability of purchase for visits to bn.com, conditional on the website visit variables for the fitted model.In comparison to amazon.comexpected sale amount peaks for slightly shorter visits of duration 52 minutes, but at a much lower value of just over $10.08.Clearly, amazon.comproves to be more successful at converting each individual visit to a higher sales amount.Moreover, if a purchase does occur, Figure 2(c) shows that expected sale amount does not increase as quickly with duration as for amazon.com.However, there appears to be little difference in the underlying purchase rates for these two websites.Figures 2(d) and 2(e) reveal that the purchase probability conditional on, respectively, duration and page views are similar for both websites.Hence, the difference in expected sale amount appears to result from higher basket totals at amazon.com.Interestingly, visits with very similar durations of 42.4 and 41.2 minutes have the maximum purchase probability for amazon.com(0.23) and bn.com (0.23), respectively.

Apparel and Travel Services
Table 1 shows that the retailers jcpenny.com,victoriassecret.comand oldnavy.comare the fifth, sixth and seventh largest retailers as measured by total number of purchases.All three are major apparel retailers, although jcpenny.comhas the most diversified product lineup, victoriassecret.com is a niche retailer and oldnavy.comsells family fashion and accessories.
In addition, Table 1 shows that the sites expedia.com,orbitz.comand travelocity.comare the third, fifth and seventh largest retailers as ranked by total value of sales.They all provide travel services and have a product portfolio that is more homogenous than the three apparel retailers.We estimate our stochastic model for each of these six sites and present some of the key relationships between sales and visits in Figure 3.
Figures 3(a) and 3(c) show that victoriassecret.comand jcpenny.comboth derive higher sales from visits than oldnavy.com;presumably due to their differing product lineups.More interestingly, it appears that victoriassecret.com is particularly successful in translating visits with longer durations into higher sales amounts when a purchase is made.Moreover, all three apparel retailers appear to convert higher duration visits into higher sales more effectively than either the two book retailers or three travel service providers.
Figure 3(b) shows that the three travel service providers have differing degrees of success at converting visits of longer duration into sales amounts.Travelocity.com is most successful, with the highest expected sale amount of $141.80 for visits of duration 112 minutes.The differences between the three sites appear driven entirely by differing abilities to convert longer duration visits into purchase events.Once the expected sale amount is computed, conditional on a purchase being made, there is minimal difference between the three travel service providers as depicted in Figure 3(d), which likely reflects the similarity of the products offered at these travel websites.
Figures 3(e) and 3(f) depict the probability of a sale being made against the number of page views for the six websites.Generally, higher page views correspond to a higher probability of purchase.Interestingly, the sites that are least successful at converting longer duration visits into sales, are not necessarily poor at converting more page views into more purchases.
For example, oldnavy.com has the highest purchase probability among all retailers, as the number of page views increases.

Research Online and Buy Online Later
A final observation that applies just to the apparel retailers concerns Figure 3(c).A subtle feature of this plot is a small dip in sales between 5 and 10 minutes.This is due to some short duration visits (less than 10 mins) where the expected sales are relatively high.We conjecture that this is due to recent prior visits to the website that are strictly browsing, without a sale being made.During this time a shopper likely peruses the merchandise, possibly checking out competing websites.Eventually when the decision to purchase is made, the transaction time is relatively quick, because product research has been completed prior to the actual purchase visit.The likelihood of such behavior is very plausable because online research prior to bricks-and-mortar purchase is commonplace (Krillion 2008;Mendelsohn et al. 2006).All we are suggesting here is the eventual purchase is made online rather than in-store.
We test this conjecture by dividing purchase visits into those that are fast (≤ 10 mins) and slow (> 10 mins), and then calculate the proportion of households in these two groups that have visited the same website within 48 hours prior to the eventual purchase visit, but have not purchased anything during those prior visits.Table 8 reports these proportions.Fast purchasers of apparel products research online before making the purchase much more often than slow purchasers (between 27.5% more often for jcpenney.comand 62.7% more often for oldnavy.com),consistent with our conjecture.This behavior is replicated by purchasers of travel services, but to a lesser extent.However, there is very little difference between the online product research undertaken by fast and slow purchasers of books or digital media products, where the purchase risk is lower.

Discrete Sales Categories for Apple.com
Apple.com has a high rate of conversion of visits into purchases at 29.1% in 2007.Moreover, the estimated relationship between website visit and sales variables is very different to that of the other retailers in our study., where purchasing visits have a higher expected sale amount as duration increases.For visits with a small number of pages views, the expected sale amount at apple.com is close to $0.99.However, this increases markedly for visits with 10 or more page views.
We also compute the Spearman's correlations between the variables for the fitted stochastic model.There is a lower dependence between the visit and sales variables when compared with amazon.com in Table 7.This is particularly true for the purchase indicator, with ρB,D = 0.223 and ρB,P = 0.183.This suggests that visitors to apple.com are more goaldirected than those at amazon.com, as might be anticipated for a website that is tailored primarily towards transactions rather than browsing.

Bayesian Model
While both the NBD and Inverse Gaussian are popular model choices for data that arise from heterogeneous populations, it is likely the data exhibit further heterogeneity.For example, Moe and Fader (2004) propose a taxonomy of four groups of online shoppers, based on visit behavior and purchase incidence.Therefore, to further account for household-level consumer heterogeneity we employ a finite mixture model with latent segmentation.This approach is well-established in marketing (Kamakura and Russell 1989;Allenby and Rossi 1998), although usually using a mixture of normals, whereas we consider a mixture of the quadvariate stochastic models at equation (2.1).
Consider a mixture with M latent segments, and probability π l that a household is a member of segment l, then the distribution of the visit and sales variables is where segments are denoted with a superscript.
Bayesian estimation of mixture models is popular, with posterior inference computed using Markov chain Monte Carlo (MCMC) methods; for example, see Diebolt and Robert (1994).
This includes the ability to profile the segments, which can be difficult using other likelihood-based methods (Wedel and DeSarbo 2002).Latent multinomial variables are introduced to specify segment membership for each household h, where L h = l if household h is a member of segment l.We denote the set of latent variables for all H households as L = {L 1 , . . ., L H }.
Conditional on L, the likelihood of the mixture model is simplified with respect to the parameters of segement F l .MCMC samplers that generate the latent variables explicitly, and the parameters conditional upon these, are popular; for example, see Diebolt and Robert (1994) and Lenk and DeSarbo (2000).In the Supplementary Material we outline such a sampling scheme to generate J iterates from the posterior distribution of the mixture model parameters, augmented with the household latent variables.
In our Bayesian analysis we adopt a Dirichlet prior for π = (π 1 , . . ., π M ) ∼ Dirichlet(α), which is a common choice because it has the Bayesian property of being conjugate to the multinomial.To make the mixture model more flexible, we make α a hyperparameter with a uniform hyperprior on [0, 2] M , which ensures the prior on π is flat at the prior expected value of E(α) = 1.The priors on the parameters of the component distributions F l 1 , F l 2 , F l 3 are the same across segments and proper, which is important to facilitate model selection, but noninformative, so that the posterior distribution of the parameters is dominated by the likelihood.We select the number of segments using a deviance information criterion (DIC), suggested by Celeux et al. (2006), which is more appropriate for model choice in mixture models than that introduced by Spiegelhalter et al. (2002).Adopting the notation of Frühwirth-Schnatter and Pyne (2010), this is defined for each value of M as In the above, y denotes the data, Φ the parameters of all M segments, Φ is the posterior mode, L is the likelihood, EN is the entropy and the expectations are with respect to the posterior distribution.This criterion is more robust to label-switching; see Lenk and DeSarbo (2000) and Stephens (2000) for discussions of this problem.The likelihood L is from equation (4.1), and the criterion can be computed efficiently using the output of the sampling scheme as discussed in Frühwirth-Schnatter and Pyne (2010;Sec. 3.3).We compute DIC 4a (M) for M = 1, 2, 3, 4, and select M that maximizes its value.We found that for all websites M = 4 was the optimal value, except for oldnavy.com,which had M = 3 segments.

Predictions
Parameter estimates and other inference can be readily computed using the output from the sampling scheme.We estimate the mixture model using the calibration data y, and then forecast purchase incidence and sale amount using the Bayesian predictive mass and expectation Pr(B i = 1|D i , P i , y) and E(S i |D i , P i , y) for each visit i in the holdout sample of the validation study.These differ depending on whether, or not, the household h associated with visit i also had visits recorded in the calibration data.Nevertheless, they can be computed using the MCMC output in both cases, as outlined in the Supplementary Materials.Table 3 includes the performance of these forecasts in the MIX column.For amazon.com and victoriassecret.comthere is a significant increase in forecast accuracy.For the remaining sites forecasts are either not meaningfully improved, or slightly less accurate for sales amounts at the three travel sites.

Segmentation for Oldnavy.com
Also of interest are the profiles of the market segments.In Bayesian estimation these are computed with the parameters and latent variables integrated out with respect to the posterior distribution, rather than conditional on their point estimates.For example, if Φ l denotes the parameters of the lth segment, then the mean of the lth fitted segment is (S, B, D, P ) [j] .
Here, each iterate (S, B, D, P ) [j] ∼ f (S, B, D, P |L h = l, Φ l, [j] ) is generated from segment l with parameter values Φ l, [j] ∼ f (Φ l |y) obtained at sweep j of the MCMC sampling scheme.
Estimates of other moments or distributional summaries for each segment can also be computed in a similar fashion.
To illustrate, Table 9 provides profiles for each of the three segments of oldnavy.comobtained using the calibration data.The top portion of the table reports the expectations of the four variables, and also the two ratios P/D and S/P .The first ratio is a measure of how fast a visitor progresses through the site (i.e., search velocity), while the second ratio is a measure of a visitor's expected spend in response to page exposure.Households in segment 1 have low purchase incidence, low expected spend and progress through the pages with the highest search velocity.In comparison, households in segment 3 have the highest purchase incidence, highest expected spend and highest expected spend per page.Moe (2003) and Moe and Fader (2004) characterise online shoppers into four groups: directed buyers, deliberation visitors, hedonic browsers and knowledge-building visitors.For our oldnavy.comsegmentation, Segment 1 has characteristics consistent with those of knowledge-building behavior, while households in Segment 3 exhibit goal-directed behavior that is more consistent with directed buyers and deliberation visitors.Segment 2 households have characteristics consistent with hedonic browsers.
The posterior probability that a specific household h is in segment l is Here, {π [j] , Φ [j] } J j=1 are the Monte Carlo iterates output from the MCMC scheme, and the Supplementary Material shows how to compute the conditional probability in equation (4.2).
The estimates ωl h differ for each household in the sample and should not be confused with an estimate of the probability π l in equation (4.1), which is not household specific.
To see if the behavior of households in each segment extends beyond their activity at oldnavy.com, we also construct three general internet activity variables for each household using the comScore transaction and session data.These are the total online spend, total number of online transactions and total number of sessions at the top 100 websites across the entire year of 2007.Using the oldnavy.comdata, for each household h we also compute the posterior probability of membership of each segment, ωj h , and allocate each observation to the segment with the highest probability.The bottom row of Table 9 reports the number of households that are allocated to each segment, and the middle portion reports the sample means of the three general internet activity variables.These show that the consumer behavior identified in our latent segmentation using the oldnavy.comdata, extends into similar general internet activity at a household level.

Conclusion
In this research we develop a stochastic model for website visit duration, page views, purchase incidence and sales amount.Previous work has modeled the bivariate distributions of visit duration and purchase incidence (Lin et al. 2010;Moe and Fader 2004;Montgomery et al. 2004;Van den Poel and Buckinx 2005), and visit duration and sales (Danaher and Smith 2011).However, ours is the first study to simultaneously handle all four of these key elements of online browsing and purchasing.
From a managerial perspective, we show that the two stickiness measures are important indicators of whether a sale will occur, and for the amount of the sale.This is consistent with an earlier empirical result by Montgomery et al. (2004), who were able to predict eventual purchase incidence with 40% accuracy using information from just the first 6 pages of a website visit.Interestingly, for books, digital media and travel service websites there is a value of duration that maximizes expected sales, while expected sales generally increases monotonically with page views.Managers will also be interested to learn that while much attention has been devoted to research online, buy offline (e.g., Thackston 2009), there is a parallel phenomenon of research online prior to purchasing (also) online.Such situations are flagged by prior visits to a website, but the eventual purchase is comparatively quick in a subsequent visit.Therefore, online retailers should not necessarily be discouraged by the high proportion of non-sale visits (Moe and Fader 2004;Venkatesh and Agarwal 2006).
We found prior online research to be especially prevalent for apparel and travel products, no doubt because such categories entail more involved purchases, and the monetary amounts are higher, than for books, DVDs and songs.We find that although websites within the same product category have different expected sales as a function of duration and page views, the relationship of these two visit variables with purchase incidence is similar.Therefore, it seems most likely that the difference in sales amounts is due to the product offerings.
Our latent class segmentation for oldnavy.comidentified three distinct market segments.
The smallest segment consists of more engaged customers who exhibit greater goal-directed behavior.Members of the largest segment exhibit browsing behavior, while those of the remaining segment are very low spending customers.Such segmentation is consistent with the classification discussed by Moe and Fader (2004), while profiling of the wider online activities of these households suggests that this behavior extends beyond their visits to oldnavy.com.
On the methodological front, we propose a quad-variate distribution for the website visit and sales variables that is constructed by composition from carefully selected components that fit well empirically.The framework has a number of practical advantages.First, estimation is fast, so that the approach is practical given the very large size of the data that arise in studies of online retail behavior.Second, expectations of the sales variables, conditional on either visit variable separately, or both together, can be computed from the stochastic model without reverting to bivariate and trivariate numerical integration.Third, computing these expectations from the quad-variate distribution, rather than modeling them directly in a regression style framework, accounts for the simultaneous determination of all four variables.Fourth, the model can extended to a finite mixture of the quad-variate distributions, so that consumer heterogeneity can be accounted for at the household level.Last, we adapt the model to cope with the discrete pricing used by retailers such as apple.com.To acheive this we exploit the flexibility of the bivariate copula component to model a combination of continuous and discrete marginals; something that would otherwise be difficult.
Our validation exercise shows that the proposed model outperforms a number of alternative approaches for the websites examined.This indicates that the nonlinear dependence between the variables is better captured by the model, and that the two website stickiness measures provide valuable information when predicting purchase incidence and sales.Importantly, our model is parametric and easy to extend into a variety of directions.Future work could include adding another layer to the model to incorporate website-level covariates, as used by Bucklin and Sismeiro (2003) and Danaher et al. (2006).This is straightforward by making the parameters of the component distributions functions of the covariates, with estimation undertaken using maximum likelihood in the same manner as outlined.By using time-based covariates, this also allows for the modeling of dynamic behaviour similar to that examined by Moe and Fader (2004).In addition, following Manchanda et al. (2006), a possible further extension is a hierarchical model that incorporates household-level heterogeneity.
This would provide an alternative to latent class segmentation.

Appendix: Sales Summaries
In this appendix we show how to use the stochastic model in Section 2.2 to compute several summary measures of sales.These include both the probability of a purchase and the expected spend, both conditional upon only one visit variable and marginalized over the second.Page views can be marginalized over as follows: Pr(B = 1|D) = P =1,2,...
Table 2: Component distributions in the joint distribution of (S, B, D, P ) in Section 2. Apart from the truncated NBD, the parameters vary over the K page view partitions.For the Gaussian copula (u) and  The final column reports the percentage difference between these two proportions.This can be approximated as Pr(B i = 1|D i , P i , y) ≈ 1 J J j=1 g(Φ [j] , L [j] h ) , using the Monte Carlo iterates {Φ [j] , π [j] , L [j] h } J j=1 .However, if no visits by household h are recorded in the calibration data, then L h is not generated as part of the sampling scheme discussed above.In this case, L [j] h is generated from the multinomial distribution with parameters π [j] = (π 1 , . . ., π [j] M ) from sweep j of the sampling scheme.
The expected sale amount of the visit is which can be approximated using the Monte Carlo sample as The expectation inside the summation above can be computed using numerical integration as discussed in Section 2.4.However, to repeat this J times is computationally demanding.Therefore, we employ the historgram estimate E(S i |D i , P i , y) ≈ 1 J J j=1 S [j] , where S [j] ∼ f (S|D i , P i , Φ [j] , L [j] h ).Let L [j] h = l, then to generate S [j] we first draw B [j] ∼ f (B|D i , P i , Φ l, [j] , L [j] h = l), which is given Section 2.4, and then S [j] ∼ f (S|B [j] , D i , P i , Φ l, [j] , L [j] h = l), which can be derived simply from the bivariate copula model.As before, if no visits by household h are recorded in the calibration data, then L [j] h is generated from the multinomial distribution with parameters π [j] from sweep j of the sampling scheme.
closed form, as detailed in the Appendix.Starting with the expected sale amount and duration relationship (marginalizing over page views), denoted E[S|D], Figure 2(a) shows that expected sales peak at $14.50 for visits of duration of 59 minutes.In contrast, a plot of expected sales against page views (E[S|P ]) in Figure 2(b) shows that sales simply increase monotonically as page views increase.

Figure 2
Figure 2(c) graphs the expected sales conditional on duration when a purchase is made (i.e., E[S|D, B = 1]), showing that expected sales increase monotonically as a function of duration among just those who eventually make a purchase.Clearly, purchase incidence has a role to play.Figures 2(d) and 2(e) give the purchase probability conditional on, respectively, duration and page views.These are computed by marginalizing out the other variable as outlined in the Appendix.Figure 2(d) reveals that, for amazon.com,purchase incidence as a function of duration (Pr(B = 1|D)) increases then declines, while Figure 2(e) shows that purchase incidence always increases as a function of page views.Hence, the reason for the differences between Figures 2(a) and 2(b) is that purchase incidence rises then declines as duration increases, but purchase incidence always increases as more pages are viewed.A likely reason for the effect observed in Figure 2(d) is that there is one or more segments of buyers who are goal-directed and therefore time-efficient in their purchase behavior.By constrast, members of other segments are simply browsing a website and are eventually tempted to purchase after a lengthy visit (see also, Bucklin and Sismeiro 2003 and Danaher and Mullarkey 2003).Hence, Figure 2(d) is likely due to a mixing of these broad segments.
Figure4shows the probability of purchase against duration in Figure4(a) and number of page views in Figure4(b).Apple.comvisitors have much fewer page views on average than other online retailers, but these convert much more rapidly into higher purchase probabilities compared with retailers in our study.The expected sale amount is low because sales are predominantly for a single song.For visits where a purchase is made, Figures4(c) and Figure 4(d) plot the expected sale amount against duration and page views, respectively.The expected sale amount peaks strongly at visits of duration 4 minutes.Compare this to the relationship for retailers of books, apparel and travel services (Figures 2(c), 3(c) and 3(d))

Figure 1 :
Figure1: Expected sale amount at amazon.com as a function of duration of visit and number of page views (on the logarithmic scale) resulting from the parametric model.Note that the impact of duration and page views on expected sales is not additive in form.

Figure 2 :
Figure 2: Expected sale amount and purchase probabilities per visit at amazon.com and bn.com for the stochastic model.Panel (a) plots expected sale amount conditional on visit duration; Panel (b) plots expected sale amount conditional on the number of page views; Panel (c) plots expected sale amount conditional on visit duration for situations where a purchase is made (i.e., B = 1) ; Panel (d) is the purchase probability against the visit duration; Panel (e) is the purchase probability against the number of page views.Ninety percent confidence intervals, calculated using the bootstrap, are plotted as light shaded intervals in each panel and for both websites.38

Figure 3 :
Figure 3: Relationships between sales and visitation variables for online apparel retailers in panels (a), (c) and (e), and online travel service providers in panels (b), (d) and (f).Panels (a) and (b) present the expected spend conditional on duration of visit E(S|D).Panels (c) and (d) present the expected spend conditional on duration and that a purchase is made, E(S|B = 1, D). Panels (e) and (f) depict purchase probability conditional on the number of page views, Pr(B = 1|P ).

Figure 4 :
Figure 4: Estimated relationships for apple.com.Panels (a) and (b) plot the purchase probability against visit duration and the number of page views.Panels (c) and (d) plot the expected spend against duration and number of page views for visits which result in a purchase.
Table 3 summarises the predictive accuracy of all methods over the holdout samples for all nine websites.The RMSE value is reported for SM, and percentage deviations from this value are reported for other methods.Positive deviations correspond to increases in RSME, and therefore lower accuracy.If the squared errors are not signifi- cantly different using a matched-pairs test at the 95% confidence level, then the percentage difference figure is given in italics.

Table 1 :
2,... f (P, B = 1, D) Top ranking online retailers in the 2007 ComScore panel.The retailers are ranked by both total sales (in $1000) over the panel and by total number of purchases, expressed as a percentage of the total number of purchases observed from the panel.
with Φ the standard normal distribution function.Note that the Log-Logistic distribution is replaced by the Empirical Distribution Function in the case of apple.com to account for highly discrete pricing.

Table 3 :
Predictive performance of the different methods for all nine websites in the validation study, including the mixture model in Section 4 (MIX).The root mean square error (RMSE) of forecasts for purchase incidence B, and also sales amount S, are reported for method SM (our proposed stochastic model).For the other methods, the pecentage difference between the RMSE values for each method, and that for SM, is reported.Positive percentages correspond to an increase in RMSE compared to method SM, and therefore lower predictive accuracy.When these are not significantly different from zero, then the percentage figure it is given in italics.A mixture model was not considered for the website apple.com,so that no value is reported for this case.

Table 4 :
Sample sizes for the nine websites in our study.Each observation corresponds to a visit, and the number of visits that ultimately result in a purchase are given, along with those that do not.The fifth column reports the number of households from which these visits originated.The last column contains the optimal number of page view partitions identified by cross-validation for each website, as discussed in Section 2.3.

Table 5 :
Optimal page view partitions for the two book retailers, presented as two adjacent partitions per row.The sample sizes in the paired partitions are reported, broken down by observations where no sale was made, and those visits where one or more items were purchased.

Table 6 :
Estimates of the parameters of the stochastic model for amazon.com,with 95% confidence intervals given below in parentheses.To save space, only estimates for the 13 even numbered partitions are given.

Table 7 :
Marginal pairwise Spearman dependence measures from the fitted model for amazon.com,and the Pearson sample correlations.The high Spearman dependence between S and B is due to the very strong zero-inflation of the sales data, which is commonplace in online retail.

Table 8 :
Proportion of households who visit a site 48 hours prior to ultimately making a purchase.The values are are broken down into two groups: those where the purchase is made quickly within duration D ≤ 10 minutes, and those made slowly in D > 10 minutes.