Bandwidth selection for kernel density estimation of fat-tailed and skewed distributions

Applied researchers using kernel density estimation have worked with optimal bandwidth rules that invariably assumed that the reference density is Normal (optimal only if the true underlying density is Normal). We offer four new optimal bandwidth rules-of-thumb based on other infinitely supported distributions: Logistic, Laplace, Student's t and Asymmetric Laplace. Additionally, we propose a psuedo rule-of-thumb (ROT) bandwidth based on a Gram-Charlier expansion of the unknown reference density that is linked to the empirical skewness and kurtosis of the data. The intellectual investment needed to implement these new optimal bandwidths is practically zero. We discuss the behaviour of these bandwidths as it links to differences in skewness and kurtosis to the Normal reference ROT. We further propose model selection criteria for bandwidth choice when the true underlying density is unknown. The performance of these new ROT bandwidths are assessed in a variety of Monte Carlo simulations as well as in two empirical illustrations, the well known data set of annual snowfall in Buffalo, New York, and a timely example on stock market trading.


Introduction
Many profest Christians are like to foolish builders, who build by guess, and by rule-of-thumb (as we use to speak), and not by Square and Rule. James Durham (1685).
Three centuries later and counting, thumbs still rule, and the use of rules-of-thumb still characterizes much of human activity, perhaps because human agents need to optimize also with respect to the costs of information gathering and processing, not to mention that they have deadlines to meet. In kernel density estimation, rule-of-thumb (ROT) bandwidths are ubiquitous. They are so ubiquitous in fact that Silverman's [1] proposed ROT (which is derived specifically for a Normal kernel and a Normal reference density) is used even when other kernels are deployed (for example [2]). While it is well known and well studied that data-driven or plug-in bandwidths deliver superior asymptotic performance [3,4], the prevalence of ROT bandwidths in applied work remains. For example, in some of the most prestigious academic journals, ROT bandwidths are the norm when presenting data: Boguth et al. [5, Figure 1] for the density of the time-series standard deviations of a measure of excess value for a panel of US firms; Brauner et al. [6,Figure 4] for the density of the posterior median effectiveness across the sensitivity analysis of the estimated instantaneous reproduction number for COVID-19; Känzig [7, Figure 2] for the density of oil price shocks following OPEC announcements; and Le Quéré et al. [8, Figure 1] for the density of change in fossil CO 2 emissions in the five years since the adoption of the Paris Climate Agreement across a range of countries.
Such dominance is undoubtedly linked to the fact that the most popular statistical languages deploy ROT bandwidths as the default, a practice that strongly reflects the appeal of ROT bandwidths: they are simple to construct, easy to code by hand, and portable across datasets. Moreover, data-driven methods are known to produce bandwidths in finite samples which lead to undersmoothing, producing higher variance estimates [9] which can be problematic for inference (tests of symmetry, independence, correct specification, etc.). To that end we ask, are there better, or more appealing, ROT bandwidths?
Several papers have focussed on ROT bandwidth selection when the kernel is changed (for example [10][11][12]). But to our knowledge, beyond Terrell [13], applied statisticians and econometricians have not investigated how changing the reference density impacts the corresponding ROT bandwidth and how this change might then impact density estimates which are estimated using such a ROT bandwidth. One may counter that this choice simply does not matter. We are cognizant of this argument, but believe it still deserves attention and formal clarification. Assessing how poorly one bandwidth may perform for a given setting when the reference density is incorrect is useful to study. Wand and Jones [14] demonstrate that the classic Sheather and Jones [15] plugin estimator stabilizes once 2-3 iterations of the reference density estimation have been undertaken. This suggests that for direct plug-in methods (which are in some sense datadetermined), the reference choice (in this case Normal) does not impact the bandwidth. However, when a 0 iteration ROT bandwidth is calculated, little insight exists on this point.
In this paper, we construct ROT bandwidths that use as reference densities three common symmetric fat-tailed distributions: Logistic, Student's-t, and Laplace. 1 All three of these densities lead to simple closed form, easy to calculate, ROT bandwidth rules that can be used at a minimum to compare one's results with Silverman's ROT based on the assumption of a Normal reference density. We assess the performance of both the 'standard' versions of these ROT bandwidths but also the adaptive variant, which is robust against the presence of outliers that may unduly affect the estimated variance in the data. We also construct a ROT when skewness is present in the data, using a specific version of the Asymmetric Laplace distribution as well as a general Gram-Charlier expansion that can be used to approximate the reference density using the empirical skewness and kurtosis present in the data.
A general finding that we present here is that reference densities with thicker tails (and slimmer bodies) than the Normal will produce smaller bandwidths than that from assuming a Normal density, which leads to less bias and more variance overall for the corresponding estimator. This is intuitive. The tails of a density require larger sample sizes to uncover the structure/shape of the tails, and the same holds for steeper slopes in the density graph. In lack of more data, the practitioner needs a smaller bandwidth to extract useful information from the tails, and to detect the steep slopes.
But this does not mean that there exists a monotonic relation between a measure like excess kurtosis and the optimal bandwidth. It does not (as we show later). Instead, what appears decisive for the size of the bandwidth is a measure like the interquartile range (IQR). By representing how narrow or wide is the central part of a density graph, the IQR carries information about the 'steepness' of the graph and hence the value of the 2nd derivative of the density, which is what largely determines the optimal bandwidth in terms of 2nd-order asymptotic mean integrated squared error (AMISE). This is perhaps why, for skewness, we find an unambiguous relationship: the optimal bandwidth decreases with skewness (for a given variance). This is because skewness, especially when it is pronounced, leads also to steep ascents or descents of the density graph.
The closest work to ours is Marron and Wand [16] who examine exact mean integrated squared error for a range of Normal mixture distributions. The Normal mixture framework allows easy calculation of the optimal bandwidth (in theory) and so exact error rates can be calculated. Our work here differs because we focus on the asymptotic performance of the optimal bandwidth as a generic smoothing device rather than as the correct bandwidth. 2 Finally, we also advocate for a model selection approach for bandwidth choice, rather than optimization of a statistical criterion, that leverages recent results of McCloud and Parmeter [17]. They show how to calculate the number of effective parameters that a bandwidth imparts on the corresponding kernel density estimator. This allows for metrics, such as the Akaike Information Criteria (AIC), to be developed to select a kernel density estimator across a variety of bandwidths.
Beyond the theoretical results, we provide a detailed set of simulations comparing the various ROT bandwidths to assess their finite sample performance along with the model selection results. The main takeaway is that once we move away from Normality (either due to skewness or excess kurtosis) Silverman's Normal-Normal ROT stops being optimal (in a MSE sense), and instead one of the new ROTs becomes a better choice, often Asymmetric Laplace. To further investigate how useful these ROTs are, we also conduct a detailed set of simulations using non-standard distributions. In these cases too, where we have various degrees of excess kurtosis, asymmetry, multimodality and peakedness, 3 the Asymmetric Laplace ROT appears best suited among the class of ROT bandwidths under consideration (evaluated through AMISE). Finally, we note that our ROT bandwidths, which can be implemented virtually at no cost, outperform in many cases data-driven and plug-in methods, which are not necessarily computationally cheap. So the ROT bandwidths we construct represent a clear improvement in the efficiency and the reliability of kernel density estimation. The Asymmetric Laplace ROT is perhaps the most serious contender to replace Silverman's ROT as the default ROT bandwidth for most data samples encountered in practice.
Our model selection results tell a somewhat different story. For underlying distributions that are either symmetric or thin-tailed, use of AIC suggests Silverman's traditional ROT is the most common winner. However, consistent with our AMISE findings, once considerable skewness or kurtosis enters the data, our new ROTs also appear as viable candidates from a model selection standpoint.
We complement our simulations with two empirical illustrations. The first looks at annual snowfall totals for Buffalo, New York. There is a debate in the literature as to whether the distribution contains one or three modes [18] and our model selection criteria suggest the two bandwidths which produce a unimodal density. In the second illustration, we look at daily stock returns for GameStop. This video-games company stock became the battleground of an intergeneration clash between investors in the beginning of 2021. This density exhibits both excess kurtosis and skewness, making it an interesting exemplar for our proposed methods. The criterion suggest bandwidths that capture the peakedness in the estimated density without excess variability or spurious modes.

Optimal bandwidths for the canonical kernel density estimator
Our focus here will be on the 2nd-order kernel density estimator of the density of x, f (x), defined as [4,14,19]: where k(·) is the kernel smoothing function and h is the smoothing parameter. 4 ROT optimal bandwidths are derived as the minimizers of AMISE which stems from the sum of squared bias and variance of f (x). For the derivations that follow, we will assume that the data sample X is independent and identically distributed (iid).
The bias of the 2nd-order kernel density estimator in (1) is [11] Bias where κ j (k) = u j k(u) du is the j-th moment of the kernel and f (2) (x) is the second derivative of the unknown density. The variance is [11] Var We use the notation R(g(x)) = g(x) 2 dx and R(g (2) (x)) denotes the roughness of a function. By combining Equations (2) and (3), the AMISE for the 2nd-order kernel density estimator is [11]:

Derivation of the optimal bandwidth
To derive a general form for the optimal bandwidth, we differentiate AMISE in Equation (4) with respect to h and set the derivative equal to zero: Here our use of the k−f superscript is nonstandard, but one that we believe will be useful with the various bandwidths we will be discussing later. Written as above, the expression clearly separates the components that multiplicatively determine the optimal bandwidth: the first depends on the choice of the kernel; the second, on the choice of the reference density; the third carries the effect on h k−f opt from the size of the sample. We will use throughout the Normal (Gaussian) kernel, as our focus is on considering different reference densities. But we do provide bandwidth factors also for the Epanechnikov kernel that will allow practitioners to implement a different kernel-reference density combination. For the Normal kernel we have that κ 2 (k) = 1 and R(k) = (2 √ π) −1 ≈ 0.282. With these values the kernel-contributed factor in the optimal bandwidth expression Equation (5) becomes 0.776. Also, in order to make the expression representative of a parametric distribution family, we will use the roughness of a distribution with unitary variance σ 2 = 1, denoting it R(f (2) 1 ). By standard rules of density transformation, differentiation and integration, we have that R(f (2) ). Consequently we obtain The adaptive variant for finite samples that is robust with respect to the sample standard deviation is [1, p. 47] The adaptive variant compares two measures of dispersion: the sample standard deviation, and the mark-up increase of the sample interquartile range (IQR) over the IQR of the assumed reference density with unitary standard deviation. This ROT uses the minimum of the two, guarding thus against outliers that may unduly impact the sample standard deviation.
In both ROT expressions, the higher the roughness of the reference density, the smaller the optimal bandwidth. Terrell [13] also noted this in his search to determine the maximal smoothness by finding the density with the smallest roughness, which turns out to be (see [13,Theorem 1]), for a 2nd-order kernel, the Beta(4, 4) distribution, leading to R(f (2) 1 ) ≈ 0.144. In this regard, no density which has at least two derivatives can have roughness smaller than this distribution, which then implies an upper bound on the bandwidth for optimal smoothing (based on AMISE).

Roughness and excess kurtosis
The inverse monotonic relation between roughness and the optimal bandwidth obtained just above, makes us ask (with our minds towards the other direction, that of increasing roughness and narrower bandwidths), are there any characteristics of a distribution that will signal higher roughness?
The answer is 'perhaps'. For symmetric densities, the existence of positive excess kurtosis signals higher roughness than the Normal density. But the relation between positive excess kurtosis and roughness is not monotonic. Instead we find an (inverse) monotonic relation between the IQR and roughness. A smaller IQR leads to higher roughness, and if the IQRs are close, then higher excess kurtosis leads to higher roughness. But the more critical characteristic is the IQR, and we will see a specific example where a distribution with high excess kurtosis has smaller roughness than a distribution that has much smaller excess kurtosis but lower IQR. For skewed distributions, it appears that irrespective of how the IQR evolves, higher skewness (in absolute terms) increases roughness.
In order to understand the relationship between excess kurtosis and roughness, consider an infinitely supported continuous random variable with density f 1 (x) symmetric around zero. Its excess kurtosis will equal By twice applying integration by parts to the integral, we are led to the following alternative expression, while the roughness of a symmetric (around zero) distribution is As we move towards the tails, the graph of a density with infinite support becomes necessarily convex, and so f (2) (x) will be positive, but smaller and smaller as we move along extreme values. But the contribution of the tails in the excess kurtosis coefficient will be the accumulation of products x 6 · f (2) 1 (x) where x 6 increases for higher values of x. On the other hand, the contribution of the tails to the roughness will be the accumulation of the products f (2) 1 (x) · f (2) 1 (x) that fall much faster in value as x increases. Distributions with fat tails/high excess kurtosis will have relatively higher values of f (2) 1 (x) as we move towards the tails (because this puts the brakes on the reduction of the value of the density, hence a slower decay), and this indeed creates a tendency to have increased roughness also. But the excess kurtosis coefficient will increase much faster than roughness for the same series of f (2) x (x) values, on account of the factor x 6 . Alternatively, closer to the central region of the distribution, say in the (0, 1) interval, the factor x 6 will disproportionately shrink the contribution of f (2) 1 (x) in the excess kurtosis coefficient. Now, if the density has a small IQR, this will produce a steeper initial decline to connect the central region with the tail region. But a 'steeper decline' means high values for f (2) 1 (x). Here, the value of f (2) 1 (x) may be negative (for concave parts of the density graph), in which case the contribution of the central region to γ 2 will be negative, while it will be positive for roughness. But even if f (2) 1 (x) is everywhere positive (i.e the density is everywhere convex, such as the Laplace distribution for example), still the dampening effect of x 6 in the central region will make its contribution to excess kurtosis small.
The above discussion is in accord with the findings and the forceful argument of Westfall [20], that excess kurtosis is a meaningful measure for tail extremity but not for 'peakedness'. In symmetric distributions, peakedness is monotonically connected to the IQR: the higher the peakedness, the smaller the IQR, and hence the higher the f (2) 1 (x) values, and so the higher the roughness. So roughness relates primarily to peakedness (that proxies steepness) and then to tail fatness, while excess kurtosis relates primary to tail fatness and then, if at all, to peakedness. A connection between the two exists, but not one that would lead to a monotonic relation.
To see this in a more tangible way, consider another distribution with density g 1 (with g 1 sharing the same characteristic with f 1 as regards to its support: its mean and variance) and the difference of their excess kurtosis and of their roughness: 1 (t). f 1 will have fatter tails than g 1 , and so eventually its 2nd derivative should be positive and larger than the 2nd derivative of g 1 . Does this imply that we will get the same inequality for roughness?
Not necessarily. If both distributions have convex densities, then their 2nd derivatives are everywhere positive. Closer to the origin we will necessarily have g (2) 1 (x) > f (2) 1 (x) so the difference term in the roughness expression will be negative, tending to reduce the overall value of the roughness difference. 5 Moreover, the contributions to the roughness for values of the variable near the tails will be small, and so, in principle, we could obtain a negative difference, R(f 1 ). Considering alternative scenarios with regards to the concavity/convexity of the densities, leads to analogous ambiguous results.

A Gram-Charlier approximation for roughness
The previous discussion is a warning against using without caution density approximations in order to assess the relationship between roughness and excess kurtosis. Case in point, the use of a 2nd-order Gram-Charlier type A series expansion of the density of a general distribution: where μ is the mean, σ is the standard deviation and γ 1 and γ 2 are the skewness and excess kurtosis coefficients respectively with He 3 (x) = x 3 − 3x and He 4 (x) = x 4 − 6x 2 + 3 the 3rd and 4th order Hermite polynomials (as used in probability theory). Setting μ = 0, σ = 1, computing the second derivative of the above expression, squaring it and integrating, the roughness of a distribution with unitary variance is approximated by 6 The first correction term will be zero for symmetric distributions. From the above approximation, one would conclude that when excess kurtosis is positive, it has a positive monotonic relation with roughness. But this is not true, and as we have said, we will provide a specific counterexample shortly. What the above approximation is valid for, is to conclude that excess kurtosis and skewness lead to higher roughness than that of the Normal distribution. 7 So, with distributions that have fatter tails than the Normal, or are skewed, the optimal bandwidth should be smaller than that produced by the Silverman Normal-Normal ROT. This motivates the construction of ROT bandwidths for fat-tailed and skewed distributions, a task to which we now turn.

Optimal bandwidths under alternative symmetric distributions
We will consider three well known symmetric distributions that have positive excess kurtosis: Logistic, Laplace, and Student's t with 5 degrees of freedom. Their excess kurtosis coefficients are, respectively, 1.2 for the Logistic, 3 for the Laplace, and 6 for t (5). We chose them to cover a range from mild to high excess kurtosis values. Our goal is to calculate h N−f opt for these distributions as reference densities, always for a 2nd-order Gaussian kernel, and to determine if any of these may be suitable for general use as a ROT bandwidth, against the benchmark ROT bandwidth that is based on the Normal distribution for which we have roughness and optimal bandwidth (see [1, p. 45 As has been discussed in the Introduction, we are unaware of a ROT bandwidth that attempts to account for the presence of excess kurtosis (or skewness) in the data. Silverman himself may have contributed to this by asserting that his Normal-Normal ROT optimal bandwidth, especially its 'adaptive' variant, fared well even in the presence of skewness and excess kurtosis in the data, providing only summary indicative results to support this assertion [1, p. 46-48]. While asymptotically the differences will dissipate, it is instructive to learn if, and how, finite sample differences appear. This analysis is also useful more generally as the bandwidths we derive are optimal when the true density is in fact Logistic, Laplace, or Student's t(5), respectively.

Logistic
The most widely used Logistic distribution has density The roughness of the Logistic density is R(f (2) ) = 1/42s 5 , its variance is σ 2 = s 2 π 2 /3. To standardize, we set s = √ 3/π . The standardized roughness is This is more than double the roughness of the standard Normal density (which is 0.2115). For the adaptive variant of the ROT bandwidth, the interquartile range of the Logistic distribution with unitary variance is IQR = 1.211.

Laplace
The Laplace distribution has density The roughness of the standardized Laplace density is R(f (2) ) = 1/4b 5 .
The variance of the Laplace distribution is σ 2 = 2b 2 , so to standardize it we need to set b = 1/ √ 2 and we obtain R f almost seven times higher than the roughness of the standard Normal, and three times higher than the roughness of the Logistic. The interquartile range for unitary variance is

Student's t
Let p denote the (integer) degrees of freedom. The density of the Student's-t distribution can be written as The roughness of the distribution is . This basic version of the Student's-t distribution has implicitly a scale parameter that is set equal to unity, and cannot have variance lower than 3 (for integer degrees of freedom). The variance of the t(p) distribution is σ 2 = p/(p − 2). The appropriate scaling to obtain the roughness for unitary variance is to multiply the expression by [p/(p − 2)] 5/2 , and we obtain R f But this is smaller than the roughness of Laplace that has half the excess kurtosis of t(5) (for same variance). This is the counterexample that shows that the relation between excess kurtosis and roughness is not monotonic. The IQR of t(5) (scaled to have unitary variance) is IQR ≈ 1.1257. Distribution We collect in Table 1 our results, where we also include the value of each distribution at the mode, f 1 (mode), and of course, the factors for the optimal bandwidths per case. For convenient reference and application, we have also included the optimal bandwidths when the Epanechnikov kernel is used, that has κ 2 (k) = 0.2 and R(k) = 0.6, resulting in a kernelcontributed factor in the optimal bandwidth of 1.719. The last two columns of Table 1 are the alternative optimal ROT bandwidths to be used in applied work, multiplied either bŷ σ n −1/5 , or by the adaptive variant of this factor, see Equation (7).
The table shows clearly the monotonic relation between the density value at the peak and the IQR, and of the IQR with roughness (and so with the optimal bandwidth). Note also that the roughness of the Student's t(5) is much higher than the roughness of the Logistic, even though their IQRs are relatively close; here the high excess kurtosis of t (5) amplifies the increase in roughness. But between t(5) and Laplace, the effect of the excess kurtosis is no match for the effect of the much increased peak and reduced IQR, which makes the roughness of the Laplace almost double that of the t (5). Apart from that, we see that the optimal bandwidth narrows considerably as we move along the lines compared to the Normal-Normal ROT bandwidth: it is 15% smaller for the Logistic, 22% smaller for t (5) and 32% smaller for the Laplace distribution. 8 A way to understand why IQR will not necessarily decrease as excess kurtosis increases, is to consider a 4th-order Cornish-Fisher expansion of the quantile function, to express the IQR of a distribution. There one would find that, first, as excess kurtosis increases, the IQR tends unambiguously to decrease from the expansion terms that include the excess kurtosis coefficient, but also, that the IQR tends to increase as the 6th moment increases, which is also present. But the 6th moment increases when the 4th increases (which relates to excess kurtosis). If the structure of the distribution is such that the 6th moment increases disproportionately compared to the increase of the 4th, then we may end up seeing higher excess kurtosis, but higher IQR also, and hence lower roughness.
Next we will see that an even stronger determinant of roughness than IQR is skewness, where even though IQR increases, the roughness increases because skewness increases.

An optimal bandwidth when the distribution is skewed
In this section, we construct a ROT optimal bandwidth for data samples that exhibit skewness, which is a frequent phenomenon. Unlike excess kurtosis, skewness is usually dependent on a shape parameter of the distribution even after we standardize the variance (the shape parameter allows symmetry when it takes some specific value). This means that in order to compute the ROT optimal bandwidth, we will need not only to estimate the standard deviation of the sample, but also the sample skewness coefficient, in order to recover an estimate for the shape parameter.
We will base our ROT on a version of the Asymmetric Laplace distribution, which is generated as follows: Consider two independent Exponential random variables Z 1 and Z 2 with scale parameters σ 1 = θ/τ , and σ 2 = θ/(1 − τ ), respectively, for θ > 0, and τ ∈ (0, 1). Then the distribution of their difference For τ = 1/2, we recover the Laplace distribution with scale parameter b = 2θ. The mean and variance of this distribution are while its skewness coefficient is By computing the sample skewness, we can solve the above for τ , and use it to compute the roughness and the optimal bandwidth. Note that if τ = 1/2, the density we will use produces a non-zero mean, but this will not affect the roughness or the optimal bandwidth. For the distribution standardized to have unitary variance, the roughness is We see that τ and (1 − τ ) are exchangeable in the above expression, meaning that R(f In Table 2, we present the values for skewness, τ , mode, IQR, standardized roughness, and optimal bandwidths for values of skewness in [0, 1.9]. Given that τ and (1 − τ ) are exchangeable up to a sign change (so the absolute value of skewness is symmetric around τ = 0.5) in the expressions for the skewness, it follows that τ (−γ 1 ) = 1 − τ (γ 1 ). So we need only present the values for positive skewness. If our sample has, say, skewness γ 1 = −0.9, the corresponding value for the distribution parameter will be τ = 1 − 0.387 = 0.613. The roughness and the optimal bandwidth depend only on the absolute value of the skewness.
We see in Table 2 that as the strength of skewness increases, so does the peak, the IQR, but also the roughness. Intuitively, skewness leads to the long tail 'stretching' relatively more than how much the short tail contracts, leading to higher IQR. It also leads to a steeper slope near the origin that leads to increased roughness. Skewness appears to be the stronger of the various characteristics of a distribution that we have considered, as regards to its unambiguous effect on roughness, and hence on the optimal bandwidth. In the specific reference distribution, skewness adds roughness on top of the roughness of the symmetric Laplace distribution.

Efficiency of alternative parametric families
Beyond the construction of these optimal bandwidths, we can also investigate how well the estimation of the density compares across alternative densities when alternative 'optimal' bandwidths are used.

Alternative ROT efficiency
Plugging the optimal bandwidth expression Equation (5) into the AMISE expression (4), the minimized AMISE under correct specification of the reference density is Using the Normal kernel values produces It is clear that the higher the roughness, the higher the minimized AMISE. Since roughness exceeds the Normal roughness when we have excess kurtosis, we get that distributions with slimmer bodies and thicker tails are harder to estimate from the standpoint of mean squared error. The same holds for skewed distributions compared to symmetric ones, especially to their symmetric version in their own distribution family.
But this does not mean that going for the lower AMISE assuming correct specification, is beneficial if, in this way, we misspecify the reference density: misspecification leads to even higher AMISE, as we show next.
Looking at losses of efficiency due to misspecification of the reference density, is in some sense more informative than examining similar efficiency losses predicated on kernel choice [21], since this is purely at the user's discretion. The AMISE formula when an alternative optimized, yet misspecified, bandwidth is used (say the Laplace optimal bandwidth for data from a Normal density), becomes (see Technical Appendix) We are interested in Eff where we have defined the relative roughness ratio R 1 = R(g (2) 1 )/R(f (2) 1 ), where g 1 is the assumed density and f 1 the true density. We mention here that the comparison of AMISE between different choices of the reference density is independent of the kernel and hence whether we were to use a Normal kernel or the Epanechnikov kernel (as an example), would have no effect on our results. It is easy to determine that the relative efficiency is minimized when g (2) 1 = f (2) 1 and takes the value 1, while Eff But the effects of misspecification are not symmetric, and this can be seen in Table 3 where we tabulate the efficiency ratios of the distributions we are considering (the variance in all cases is standardized to unity).
What we observe in Table 3 is that oversmoothing is worse than undersmoothing, in terms of efficiency loss. For example, if the true density is Normal, but we use the Laplace ROT bandwidth (undersmoothing), the efficiency ratio is 1.214. But when the true density is Laplace and we use the Normal ROT bandwidth (oversmoothing), the efficiency ratio climbs to 1.463. This holds for all entries in the Table. It provides formal motivation to not use the Normal ROT bandwidth when the data exhibit excess kurtosis or narrow IQR, but a ROT based on a density with higher roughness, even if we end up undersmoothing. The informal motivation for undersmoothing comes from Silverman himself [1, p. 43], who sensibly observed that it is better to undersmooth than oversmooth because the eye of the observer can smooth much more easily than 'unsmooth'. 10 In fact mentally 'unsmoothing' is practically impossible because there is no information left in the oversmoothed density to guide such 'unsmoothing'. This problem is similar to the case of a linear against a nonlinear graph: a non-linear graph guides the eye towards its linear ('smoothed') version, but the same linear graph may be the result of approximating any one of very different nonlinear graphs. The counterargument in favour of oversmoothing is, to quote Terrell [13, p. 472], 'An undersmoothed density estimate tends to display features such as asymmetries and multiple modes that could have come about by chance. By using the most smoothing that is compatible with the scale of the problem, we tend to eliminate accidental features'.

The effects of estimating the standard deviation
We close this section by noting that all its derivations and computations have been conditional on the sample standard deviation (or on the assumption that the true standard deviation is known a priori). Essentially we were elaborating on the sample-specific AMISE. A further refinement would be to use the sample standard deviation in the optimal bandwidth expression (because this is how we would be able to actually compute it), but to keep the true standard deviation to accompany the roughness of the true standardized distribution in the correctly specified AMISE, the misspecified AMISE and the relative efficiency expressions, because this is what characterizes the true data generating process. In such a case, Equation (16) would become (see Technical Appendix): In this way, one would take into account also the consequences from the inaccuracy in the estimation of σ . Minimizing this AMISE expression with respect to ( σ /σ ), we find that it has an optimal value, ( σ /σ ) * = [R(g 1 )] 1/5 , for which we recover the minimum AMISE under correct specification. While this is not feasible as we don't know the true standard deviation in the first place, it gives us the following qualitative results: if in reality we are over-smoothing, R(g 1 ), we would want the sample standard deviation to under-estimate the true value. But if we are under-smoothing, R(g (2) 1 ) > R(f (2) 1 ), we would want the sample standard deviation to over-estimate the true value, in order to mitigate the loss of efficiency in AMISE terms. In some cases this could be useful guidance in deciding whether to correct the sample standard deviation for bias or not.

Model selection across bandwidths
Rather than select h such that AMISE is minimized, we could turn our attention to a model selection approach. Typically, model selection would seek to optimize some pre-specified criterion and then penalize this based on the complexity of the model, in most cases the number of parameters. As nonparametric kernel density estimators do not have 'parameters', it is not obvious how such a correction would proceed. Recently, McCloud and Parmeter [17] have demonstrated how to calculate the effective parameters from a kernel density estimator for a given bandwidth. 11 This opens up the use of traditional model selection criteria for use with alternative bandwidths.
While there are many model selection criteria, we focus on AIC. AIC, for a given bandwidth, selects the model which has the smallest value of where H is the n × n hat matrix for the kernel density estimator with i,j element defined as [17]: H has useful geometric properties. It is symmetric with entries bounded between 0 and 1, and rows that sum to 1. The use of tr(H) to calculate the number of parameters is directly linked to the size of the bandwidth; in effect as h → ∞ the number of parameters goes to 1 and for h → 0 the number of parameters goes to n.

Simulated performance of alternative ROTs
We consider two distinct sets of simulations. First, we generate data from each of the parent distributions used to construct the optimal bandwidths and determine how well each performs when used (potentially) erroneously for a different distribution. We consider 1000 Monte Carlo simulations for samples sizes n ∈ {50, 100, 200, 400, 800}. For each simulation we estimate the kernel density estimator using each of the proposed ROT bandwidths over a grid of 100 points. Our second set of simulations mimics the first, except we chose distributions from the suite of densities discussed in Marron and Wand [16]. Specifically, we focus attention on the Skewed, Kurtotic, Bimodal, Separated Bimodal, Asymmetric Bimodal, Claw and Asymmetric Claw densities. 12 These densities take an array of nonstandard shapes relative to the more familiar parent densities used to derive the ROT bandwidths. We use the command npudensbw() in the np package [22] in R to calculate the kernel density estimates for all simulations along with a 2nd-order Gaussian kernel for all computations. We compare each bandwidth selection method via mean squared error (Tables 4 and 5) and via our model selection criteria (Tables 6 and 7). Table 4 presents the mean (over the 1000 simulations) of the mean squared error evaluated over the 100 grid points. There are several key takeaways from the results. First, the Normal ROT does well under symmetry and for mildly slim IQR up to the Student's-t (5). When the We also considered adaptive versions of all of our ROT bandwidths. As the results are generally worse than the non-adaptive ones, those results are not presented here (but are available upon request). The main takeaway for applied work is that it is feasible to use the Normal ROT for data which are symmetric and have a mildly slim IQR, but one should consider the use of the Laplace or Asymmetric Laplace ROT when the data have skew or high peakedness. Table 5 presents the mean MSE from these simulations. Here we again consider n ∈ {50, 100, 200, 400, 800}. We immediately see that for all of the proposed ROT bandwidths under study, the mean MSE decreases by nearly 50% as the sample size doubles. This is consistent with the MSE consistency of the kernel density estimator in general. A noticeable outcome of these simulations is that the proposed Asymmetric Laplace ROT (aLap) has the most 'wins' relative to the other ROT proposals. As our theory of the optimal bandwidth dictated, in the presence of extreme skewness or kurtosis, a superior ROT bandwidth exists, namely the Asymmetric Laplace.

Versus data driven bandwidths
Although our primary interest is in comparison to existing ROT bandwidths, it is also of interest to see comparisons to data driven methods. We consider the same set of simulations run for Table 4 in Table A1 as well as for Table 5 in Tables A2 and A3. 13 Recall that the former set was for our standard distributions (e.g. Gaussian) while the latter set was for the non-standard distributions [16]. We consider the two most popular methods, least-squares cross-validation (LSCV) and direct plug-in [15,labeled SJ]. For completeness, we also report bandwidths obtained from our Gram-Charlier (GC) approximation via Equation 9 (replacing the unknown moments by their sample estimates). 14 The results from Table A1 look nearly identical to those in Table 4. The three additional columns are added, but the winner still continues to be a particular ROT bandwidth. This is to be expected as these distributions are well behaved and data driven methods often undersmooth. There are seven different distributions in Tables A2 and A3. The SJ bandwidths dominate in two of them (Kurtotic and Separate Bimodal). LSCV bandwidths dominate for larger samples in another two scenarios (Claw and Asymmetric Claw). If there are any surprising results, it is for the Kurtotic distribution. Here we expected one of the ROT bandwidths to have picked up the fat tails better than a plug-in bandwidth. The Separated Bimodal is understandable as none of the ROT bandwidths account for bimodality (especially such separated modes). Similarly, the Claw and Asymmetric Claw are highly variable densities with multiple modes and LSCV bandwidths tend to undersmooth and hence mimic the data better.
Overall, we see that our ROT bandwidths often outperform data driven bandwidth procedures. Typically the data driven methods show improvements when the underlying density is multimodal. It may be useful to consider extending our bandwidths to account for densities that are not single peaked.

Performance via model selection criteria
The results above are useful, but these are essentially oracle results because the underlying distribution is known. In practice, we will not know the true underlying distributions. Therefore, we consider a method for selection of reference (or data driven) bandwidth selection when the true underlying density is unknown. In this subsection, we look at the same sets of densities as before, but via the lens of Section 5.3. Tables 6 and 7 are analogus  to Tables 4 and 5, respectively. Each value in the table represents the percentage of time the AIC criterion picks a given bandwidth for a given sample size for a given underlying density amongst the set of candidate bandwidths. 15 Table 6 presents the percentage of time (over the 1000 simulations) that the model selection criterion (AIC) picks a given bandwidth. These results are somewhat striking. Regardless of the underlying distribution, the model selection criterion picks the Normal ROT bandwidth. The Online Supplement shows this to be the case for each of the model selection criteria. The BIC criterion picks the Normal ROT nearly 100% of the time for n ≥ 100.

Non-standard distributions
For the [16] distributions,

Versus data-driven bandwidths
In practice practicioners have access to data driven methods as well. We created analogous tables to those in Tables A1-A3, but for the model selection criteria. For example, Table B11 in the Online Supplement, gives the percentage of time the AIC criterion picked a given bandwidth procedure over the 1000 simulations for the standard distributions by including both ROT and data driven methods (GC, LSCV and SJ). For an underlying Gaussian distribution, AIC picked LSCV for each sample size. It also picked LSCV for n = 50 for an underlying Logistic or t 5 distribution. For every other case, the Normal ROT produced the lowest value for AIC. The results were nearly identical for the remaining selection criteria. For the [16] data, there is substantial heterogeneity. For example, Table B16 in the Online Supplement gives the AIC criterion for the same set of [16] distributions. The Skewed distribution again suggests a Normal ROT. The Kurtotic distribution leans towards an Asymmetric Laplace bandwidth while the Bimodal distribution (and the Asymmetric Bimodal distribution for n ≥ 200) leans towards the Logistic bandwidth. The Separated Bimodal distribution suggests that the SJ bandwidth would be appropriate, while the Claw and Asymmetric Claw distributions suggest LSCV for smaller sample sizes and SJ for larger sample sizes. Similar results hold for AIC, GCV and RICE. The BIC criterion (Table B20) often suggests the GC bandwidth (the only table where that bandwidth shows prominence).

What did we learn from these simulations
When attempting to minimize AMISE, using the correct bandwidth selector for the underlying density generally appears to be acceptable. However, in practice, we do not know the true underlying density. When we move to model selection criteria, if the distribution is relatively well behaved, the Normal ROT appears to do well. However, with data that is less well behaved, which we often obtain in practice, different methods may work in different scenarios. It therefore seems prudent to use a model selection criteria in practice to help determine which bandwidth is best suited for a particular dataset.

Illustrations
In this section, we apply both existing and proposed bandwidth selectors to two separate datasets. Our first example is the well studied annual Buffalo snowfall data [23]. This example is interesting as there is a debate in the litearature [24] on the number of modes in the density (1 vs 3). It is well known that the number of modes here is tied to the bandwidth [18]. In our second example, we look at daily stock returns. This type of data is particularly relevant as it is well known that daily stock prices are often characterized by skewness [25] and have heavy tails [26].
We note here that it is arguable that the first dataset is i.i.d., whereas the second is clearly not. Our assumptions call for i.i.d. data, but we are curious to see how they perform with time dependency.

Yearly snowfall
The annual Buffalo snowfall data is well studied in density estimation [24]. There is a debate in the literature as to whether the density is unimodal or trimodal and that conclusion is directly tied to the bandwidth. Here we propose to use existing and our alternative bandwidth selection criteria to estimate this density. We further use our model selection criteria to attempt reach a conclusion on the number of modes in the density.
Yearly snowfall values (to the nearest tenth of an inch) for Buffalo were recorded from 1910 to 1972. For this dataset, we have an IQR of 27.3750, a skewness of 0.0366 and an excess kurtosis of −0.5942 (i.e. the density is platykurtic). Figure 1 presents the kernel density estimate for our sample of 63 observations using the ubiquitous Normal ROT, along with each of our newly derived rules-of-thumb bandwidths as well as data-driven methods like least-squares cross-validation (LSCV), and the direct plug-in method (SJ). 16 A close look shows that the Normal and GC bandwidths produce a unimodal estimated density. The remaining bandwidths either produce a multimodal density or a density with shoulders. This figure shows why there is a debate in the literature.
Given that we do not have a prior on the number of modes, we turn to our model selection criteria. Table 8 gives the values for the model selection criteria for each bandwidth for each criterion. Regardless of the criterion, the GC bandwidth is selected and the normal rule-of-thumb bandwidth is a close second. As we mentioned above, each of these bandwidths led to a unimodal density. If we believe these results, we side with authors who believe that the density of annual snowfall data is unimodal [24].

Daily stock returns
Here we look at daily returns of GameStop stock over the period 4 February 2020 to 2 February 2021. This particular stock is both statistically (for the above aforementioned reasons) and economically interesting. In January, 2021, GameStop became headline news as a Reddit community (r/wallstreetbets) decided to work together in order (many via purchases on the Robinhood app) to increase the stock price of GameStop (GME). At the time, it was known that GameStop was one of the most shorted publicly traded firms. Short sellers borrow and sell a stock when its price is high, betting that it will fall. The activity of the Reddit community resulted in a temporary squeeze on these short sellers and some firms lost tens of billions of dollars in a very brief time period (e.g. S3 Partners). Over this period of time, the daily return hit a maximum change of 135% and a minimum daily return of roughly −60%. As expected, the bulk of returns of the distribution are near zero, but there is a lot of activity in the tails. It seems unlikely that a Normal approximation is appropriate for this particular dataset. The data fit our setting, we have an IQR of 7.6598, a skewness of 3.8645 and an excess kurtosis of 27.4361. Figure 2 presents the kernel density estimate for our sample of 252 observations using most of the same bandwidths presented in Figure 1. We see several immediate features. First, Student-t, Asymmetric Laplace and SJ suggest a pronounced peakedness that is completely missing in the Normal ROT setup. Second, Logistic, Laplace and LSCV offer a compromise between the three aforementioned methods and the Normal ROT, picking up some peakedness, but not to the extent of Student-t/Asymmetric Laplace/SJ. Third, the three bandwidths that produced the highest peak seem to suggest the possibility of a second mode around 25.
Given that we do not know the underlying distribution, we switch our attention to the model selection criteria for bandwidth selection. Table 8 gives the values for the model selection criteria for each bandwidth for each criterion. For AIC, AIC c , GCV and RICE, the Laplace bandwidth produces the smallest value amongst ROT bandwidths. Amongst all bandwidths, LSCV produces the smallest value across these four selection criteria. As for BIC, the Logisitc bandwidth produces the smallest value amongst all bandwidths. We note that visually the Laplace, Logistic and LSCV bandwidths produce very similar features in Figure 2. Each of these methods appears appropriate here as the others tend to have spurious modes. The literature suggests that the density for a single asset should be unimodal [27].

Conclusion
In this study, we have enlarged the menu of available ROT bandwidths by considering alternative reference densities that reflect excess kurtosis and/or skewness. A technical benefit of such ROTs is that they impart simple intuition on the behaviour of the ROT bandwidth. When the skewness increases, the ROT bandwidth is monotonically decreasing in the skewness (for a given variance). This feature stems from the impact that the IQR has on the overall density. The IQR carries information about the slope of the density graph which translates to the magnitude of the 2nd derivative of the density, the main feature of the density that determines the size of the AMISE optimal bandwidth.
Given the series of reference bandwidths, we also proposed a novel density (model) selection approach based on common model selection criterion. This model selection approach offers a new way to think about bandwidth selection amongst a set of reference densities, similar to the work of Hansen [28] for kernel order selection.
Simulations indicated that the proposed new ROTs outperform the Normal ROT that is in widespread use, especially with data whose distribution has a slim body or exhibits skewness. They also often outperform data-driven and plug-in methods that appear to continue to be plagued by unfulfilled asymptotic promises when they face finite data samples. We also demonstrated that once skewness or excess kurtosis becomes prevalent in the data, the common Normal ROT is less likely to be deemed the winner for model selection. That honour fell to the Asymmetric Laplace ROT.
Finally, we applied our bandwidth procedures to two separate empirical datasets. For each dataset, different bandwidths told a different story. Our model selection criteria picked bandwidths which aligned with results in the literature.

Notes
1. We use the term 'fat-tailed' in its most general sense, to refer to distributions that have high/higher than the Normal kurtosis, but whose moments exist. We acknowledge that in certain scientific corners, the term is used to refer only to distributions that have no moments at all. That these densities may be considered common, consider that Katsiampa [29] uses the Student's-t distribution to study volatility of Bitcoin and Ether cryptocurrencies, Rudy et al. [30] study a Lorenz equation, with measurement noise drawn from a Student's-t distribution, Tiwari et al. [31] find that a skewed Student's-t distribution is the best fit for the residuals based on the model they estimate, while Sun et al. [32] use an Asymmetric Laplace mixture model to study grayscale image segmentation. Applied examples abound of using these distributions as a basis for the construction of the statistical model. 2. Moreover, Marron and Wand's [16] approach would not yield usable ROT bandwidths as they require estimation of a large number of parameters of the Normal mixture, which for small samples may be estimated quite imprecisely. 3. Peakedness captures the probability of a given deviation around a point and is commonly associated with the height of the mode of a distribution. Here we define peakedness as in Birnbaum [33]: Let X 1 and X 2 be real random variables with real constants a 1 and a 2 . X 1 is more peaked about a 1 than X 2 around a 2 if P(|X 1 − a 1 | ≥ t) ≤ P(|X 2 − a 2 | ≥ t) ∀ t ≥ 0. 4. See Rosenblatt [34] and Parzen [35] for the original exposition on the kernel density estimator. 5. The necessity of the reversal of the inequality comes from the fact that the integral of the 2nd derivative of a density over the support is zero for symmetric densities with infinite support. So it is not possible that f (2) 1 (x) > g (2) 1 (x) for the whole range. 6. We note here that Dharmani [36] independently derived (his Equation (7) on page 6) our series skewness/kurtosis formula. Our derivation, together with the mathematical derivations in Sections 3-5 are included in a Technical Appendix that is available from the authors upon request. 7. We note that transforming the Gram-Charlier type A expansion into an Edgeworth one and including even four additional higher order terms as prioritized by the latter, would not affect the result for symmetric distributions, because these additional terms are multiplied by the 3rd moment which is zero. 8. The roughness values can be used to compute optimal bandwidths with the use of other kernels if so desired, while the IQR values can be used to construct the adaptive variants of these ROT bandwidths. 9. This distribution has been considered by Poiraud-Casanova and Thomas-Agnan [37] in relation to quantile regression and estimation. 10. That said, we are aware that viewers are consumers and consumers may be lazy -they would prefer to bear the risk of being misled by viewing a smoothed graph, rather than do the mental effort to smooth a ragged one. 11. See also McCloud and Parmeter [38] 12. We note here that none of our ROT bandwidth consider densities with more than a single peak.
In practice, it may make sense to determine whether the underlying density is multimodal (e.g. see Hall et al. [39], Henderson et al. [40] and/or Minnotte [41]). 13. Tables A1-A3 are available in the Online Supplement. 14. In all the simulations we ran, there was only one case, for a single sample size whereby the GC bandwidth performed best.