Applications of Information Measures to Assess Convergence in the Central Limit Theorem Applications of Information Measures to Assess Convergence in the Central Limit Theorem

The Central Limit Theorem (CLT) is an important result in statistics and econometrics and econometricians often rely on the CLT for inference in practice. Even though different conditions apply to different kinds of data, the CLT results are believed to be generally available for a range of situations. This paper illustrates the use of the Kullback-Leibler Information (KLI) measure to assess how close an approximating distribution is to a true distribution in the context of investigating how different population distributions affect convergence in the CLT. For this purpose, three different non-parametric methods for estimating the KLI are proposed and investigated. The main findings of this paper are 1) the distribution of the sample means better approximates the normal distribution as the sample size increases, as expected, 2) for any fixed sample size, the distribution of means of samples from skewed distributions converges faster to the normal distribution as the kurtosis increases, 3) at least in the range of values of kurtosis considered, the distribution of means of small samples generated from symmetric distributions is well approximated by the normal distribution, and 4) among the nonparametric methods used, Vasicek's (1976) estimator seems to be the best for the purpose of assessing asymptotic approximations. Based on the results of this paper, recommendations on minimum sample sizes required for an accurate normal approximation of the true distribution of sample means are made. 1 We would like to thank two anonymous referees and Professor Farshid Vahid for their helpful comments.


INTRODUCTION
A large part of asymptotic theory is based on the CLT.However, convergence in the CLT is not uniform in the underlying distribution.There are some distributions for which the normal approximation to the distribution can be very poor.We can improve on the normal approximation using higher order approximations but that does not always provide good results.When higher order terms in the expansion involve unknown parameters, the use of estimates for these parameters can sometimes worsen the approximation error rather than improve it (Rothenberg, 1984).
From time to time, researchers point out problems associated with the CLT.In contrast to textbook advice, the rate at which a sampling distribution of means converges to a normal distribution depends not only on sample size but also on the shape of the underlying population distribution.The CLT tends to work well when sampling from distributions with little skew, light tails and no outliers (Little, 2013;Wilcox, 2003;Wu, 2002).Wu (2002) in the psychological research context, discovered that sample sizes in excess of 260 can be necessary for a distribution of sample means to resemble a normal distribution when the population distribution is non-normal and samples are likely to contain outliers.Smith and Wells (2006) conducted a simulation study to generate sampling distributions of the mean from realistic non-normal parent distributions for a range of sample sizes in order to determine when the distribution of the sample mean is approximately normal.Their findings suggest that as the skewness and kurtosis of a distribution increase, the CLT will need sample sizes of up to 300 to provide accurate inference.Other studies revealed that standard tests such as z , t and F , can suffer from very inflated rates of Type 1 error when sampling from skewed distributions even when the sample sizes are as high as 100 (Bradley, 1980;Ott and Longnecker, 2010).Wilcox (2005) observed that the normal approximation's quality cannot be ensured for highly skewed distributions in the context of calculating confidence intervals using the normal quantiles even in very moderate sized samples (e.g. 30 or 50).Shilane et al. (2010) established that the normal confidence interval significantly under-covers the mean at moderate sample sizes and suggested alternative estimators based upon gamma and chi square approximations along with tail probability bounds such as Bernstein's inequality.Shilane and Bean (2013) proposed another method, namely the growth estimator, which provides improved confidence intervals for the mean of negative binomial random variables with unknown dispersion.They observed that their growth estimator produces intervals that are longer and more variable than the normal approximation.In the censored data context, Hong et al. (2008) pointed out that the normal approximation to confidence interval calculations can be poor when the sample size is not large or there is heavy censoring.In the context of approximation of the binomial distribution, Chang et al. (2008) made similar observations.convenience, we use the Lindeberg-Levy CLT which is the simplest and applies to independent and identically distributed random observations.Only one dimensional variables are considered for convenience.The estimated KLI numbers are used to study how 'large' the sample size should be to have an accurate normal approximation.We also try to find which distributions give poor normal approximations for a particular fixed sample size using this concept.
In summary, this paper investigates four important issues; 1) how large the sample size should be for the normal distribution to provide a good approximation to the distribution of the sample mean, 2) which distribution from a class of distributions, causes the slowest convergence in the CLT, 3) which distributions give poor normal approximations for particular fixed sample sizes and 4) of the nonparametric methods used, which seems to be the best for the purpose of assessing asymptotic approximations.
The rest of the paper is planned as follows.Section 2 outlines the theory and the details of estimating the KLI.The design of the Monte Carlo experiments including the data generation process is discussed in Section 3. Section 4 reports the Monte Carlo results.Some concluding remarks are made in Section 5.

Generating observations from Tukey's Lambda (  ) distribution
Our simulation experiments used random drawings from a generalisation of Tukey's  distribution proposed by Ramberg andSchmeiser (1972, 1974).The distribution is defined by the percentile function (the inverse of the distribution function) where p is the percentile value, 1  is a location parameter, 2  is a scale parameter and 3  and 4  are shape parameters.It has the advantage that random drawings from this distribution can be made using (1), where p is now a random drawing from the uniform distribution on the unit interval.The density function corresponding to (1) is given by () and can be plotted by substituting values of p in (1) to get z = () Rp and then substituting the same values of p in (2) to get the corresponding () fz values.Ramberg et al. (1979) discuss this distribution and its potential use in some detail.They also give tables that allow one to choose 1  , 2  , 3  and 4  values that correspond to particular skewness and kurtosis values when the mean is zero and the variance is one2 .Therefore by an appropriate choice of skewness and kurtosis values, a number of distributions can be approximated by a distribution that has the same first four moments.These include the uniform, normal, Weibull, beta, gamma, log-normal and Student's t distributions.For examples of the use of this distribution in econometric simulation studies see Evans (1992), Brooks and King (1994) and King and Harris (1995).

Estimation of KLI
In order to evaluate the quality of an approximating distribution, we need a convenient way to measure divergence between distributions.One such tool is the KLI measure, introduced by Kullback and Leibler (1951).Let () gx be the true density function of a 1 q  random vector x and () a fx be an approximating density for x .The KLI measure is defined as: ( ; ) Its usefulness as a measure of the quality of approximation comes from the following properties 1 ( ; ) a I g f  0 for all g and a f .As observed by Renyi (1961Renyi ( , 1970)), the KLI measure can be interpreted as the surprise experienced on average when we believe () a fx is the true underlying distribution and we are told it is in fact () gx.The smaller the value of ( ; ) a I g f the less the surprise, and the closer we consider the approximating distribution () a fx to be to the true distribution () gx.Also note that ( ; ) a I g f is the expected value of the log of the likelihood ratio which, according to the Neyman-Pearson Lemma, provides the best test of Let 1 x , 2 x , ....., m x be a simulated iid random sample in which i x , i = 1,.., m , is an 1 n vector from either 0 H or 1 H , then the most powerful test can be based on rejecting 0 H for small values of which is the standard estimate of ( ; ) (5) from a simple random sample of size m .In this sense we feel confident in using ( ; ) a I g f as a measure of distance between ()  gx and () a fx .For further discussion of the KLI measure, see Kullback (1959), Renyi (1961Renyi ( , 1970)), Vuong (1989), Maasoumi (1993) and White (1982White ( , 1994)).
Our aim is to estimate (3) where () gx not known but a simple random sample of observations from g can be taken.The negative value of the first term of (3), () is the continuous version of the entropy of the probability density function () gx 3 .When the distribution () gx is known, it is obvious that the KLI measure can be easily estimated via the estimation of the entropy for the known distribution.But when the true distribution of () gx is unknown, nonparametric estimation methods are needed to estimate the unknown true distribution or the entropy of the unknown true distribution.A number of nonparametric techniques are available for estimating the entropy of the true distribution, however, we use the Vasicek's (1976) estimator because of its reliability (Atukorala, 1999;Guo et al., 2010), the kernel estimator because of its popularity and simplicity and the Maximum Entrophy (ME) principle because of its popularity.

The use of kernel density estimation (hereafter referred to as M1)
The kernel estimator is the most commonly used density estimator.Even though this method is not the best to use in all circumstances, it is widely used particularly in the univariate case.We use this method in estimating the true density function g in equation ( 3).A nonparametric estimator of the Shannon entropy defined as in ( 6) for an absolutely continuous distribution g , is given by where 1 x , 2 x , …, m x is a random sample generated from g and ˆ() gx is the kernel estimate of g (Rosenblatt, 1956;Parzen, 1962;Ahmad and Lin, 1976;Rao, 1973).Accordingly, an estimator for the in which ˆ() gx can be calculated as 3 The entropy measure is nonparametric since it needs not assume the probability distribution is in any parametric form.
Thus the estimation amounts to drawing a simple random sample, estimating ˆ() gx using this sample and then taking a second sample to calculate ˆ() T Ig .The kernel density function () k  and the smoothing parameter h have to be chosen appropriately.The choice of the kernel does not seem very important to the statistical performance of the estimation method.That is, the shape of the kernel does not significantly influence the final shape of the estimated density because it just determines the local behaviour (Bolance et al., 2012).Therefore, in our study, we use the standard Gaussian density for () k  .For the normal kernel, our best choice of the smoothing parameter is4 where  is the standard error of the observed data and m is the number of observations in the data set.Then the KLI can be estimated as In ( 11), ĝ is the estimated density function of the true distribution of means, where n is the size of the samples generated from Tukey's  distribution, for calculating means as explained in Section 2.1 and N f is the normal density function with zero mean and variance 1 n .
We also calculated the standard errors of estimated KLI using the square root of the statistic, var( )

The use of the Maximum Entropy (ME) distribution (hereafter referred to as M2)
Suppose we have a simple random sample of observations from an unknown continuous distribution with range   x .In the ME approach, the objective is to exploit the knowledge that the parent distribution is continuous in constructing an estimated density function, written (.) h .This function is derived by maximising its entropy subject to certain constraints.Those constraints reflect the knowledge of the parent distribution provided by the sample.
Calculating the univariate ME distribution amounts to ordering the sample observations As given by Theil and Fiebig (1984), the two constraints called (i) the mass-preserving constraint and (ii) the mean-preserving constraint have to be imposed in order to calculate the univariate ME distribution.
Then, the intermediate points between successive order statistics need to be defined as, where ()   is a symmetric differentiable function of its two arguments whose values are not outside the range defined by these arguments.The ME density function (Theil and Fiebig, 1984) is as follows: The ME distribution is obtained by maximising the entropy and the value of that maximum is called the maximum entropy.The value of the maximum entropy is The first term, 2 (1 log 2) , which is called an end-term correction, results from the exponential tails.In this paper, we use (18) to estimate the entropy of the true density function involved in (3)5 .This amounts to where N f is the normal density function with mean zero and variance 1 n .

The use of the Vasicek's entropy (hereafter referred to as M3)
When our sample observations are rearranged in the form of order statistics as given by 1 x < 2 x < … < m x , the entropy estimate introduced by Vasicek (1976) can be written as where 1 m is a positive integer smaller than /2 m .If the variance of the underlying distribution is finite, When we use Vasicek's method for estimating KLI, we replace the first integral of (3) with minus the estimate   1 , v H m m given by (20).Then the estimate for the KLI can be given as In ( 21), an appropriate value for 1 m has to be chosen.Our approach to choosing 1 m is explained in the next section.
Because the theoretical variance for these estimators given in ( 19) and ( 21) are complicated and difficult to derive, we use the nonparametric bootstrap method for estimating the standard errors for these cases (Efron, 1979)  6 .250 bootstrap samples were used in our experiments.

Kurtosis
The grid of skewness and kurtosis values given above, is sufficient for our purposes because those combinations of skewness, kurtosis and sample sizes cover a wide range of values.
By generating random numbers for each combination of these parameters, we calculate m means of each sample.To determine the value of m , we estimated our final estimator given in (11) for a range of different values for m until our estimated values are stable.The values of m range from 2,000 to 22,000 in steps of 2,000 in this case.In the case of M1, the density functions for true distributions of means were estimated using these m values under the different combinations of parameters given above.
Estimates of KLI decline as the value of m increases for all methods used in the experiments.For different m values between 18,000 and 22,000, there was not much difference between the KLI 6 Researchers find that bootstrap standard errors perform better than the conventional asymptotic standard errors in the linear regression context (Goncalves and White, 2005).
estimates.For the M1 estimator, even from the point of view of density estimation, this range seems to be reasonably good for m to use because it gives smooth density functions for most of the parameter combinations.We found m = 20,000 is a reasonably good number to use in Monte Carlo experiments by considering both the density functions and the KLI estimates.For M3, in addition to the selection of m , an appropriate value for 1 m in equation ( 21) has to be chosen.
We know that the true distributions of means of independent random samples taken from the standard normal distribution is normal, so there is no approximation error and the true value of KLI is zero.
Therefore, estimates of KLI in this case with respect to the approximate normal distribution should be near zero, typically insignificantly different from zero.Thus, the case of generating sample means using a distribution with the same first four moments as the standard normal distribution as the underlying population distribution can be considered as a benchmark for comparing results and choosing an optimal 1 m value.Table 1 lists some of the results for estimated KLI for different values of 1 m when sample sizes are 3, 5 and 10.Most of the estimates for high 1 m values such as 70 to 100 are within two standard errors of zero.Based on these results, we selected 1 m = 85 as the best value to use in our experiments for estimating KLI.

MONTE CARLO RESULTS
Selected estimated results for the KLI obtained using the three methods, namely, M1, M2 and M3, for different sample sizes ( n ) and different skewness and kurtosis values are given in Tables 2 to 5. First, we shall consider the results in the case of generating random observations from Tukey's Lambda distribution with the same first four moments as the normal distribution as they provide a bench mark for the comparison of Monte Carlo results for different methods.The corresponding results are given in the first column of Table 2.The KLI estimates obtained using M1 and M3, are very small and for all the sample sizes are not significantly different from zero.Compared to M2, they also have low standard errors.This implies that the normal distribution approximates the true distribution of the sample means extremely well, as expected.
M2 gives estimates much higher and much more variable than the other two methods.Almost all these estimated values are significantly different from zero even in the case of the underlying population distribution's skewness being zero and kurtosis being three which is the case for data generation from the normal distribution (see Table 2).These results imply that divergence between true distributions of sample means and the asymptotic distribution is high, even when the underlying population distribution is symmetric with the same fourth moment as the standard normal distribution.Even at the highest sample size for these moment values, 30 in our experiments, the results behave in a similar manner.It seems that M2 clearly produces biased and very variable estimates of KLI.Thus, it is clearly not appropriate to use it for assessing asymptotic approximations in our settings.One reason for getting these biased and variable estimates of KLI is that the maximum entropy principle provides an extreme entropy estimate.It seems that the formula for the value of the maximum entropy given in (18) should not be used as an estimate for the entropy of an unknown distribution in our case.
The results obtained using the other two methods, (M1 and M3), show that the distributions of the means of random samples taken from symmetric distributions with zero skewness and kurtosis of three give KLI values very close to zero (see Table 2).This indicates that the normal distribution better approximates the true distribution of sample means taken from such symmetric distributions.This is not surprising because the mean of random samples taken from (symmetric) normal distributions have a normal distribution.If we look at the small sample results, when M3 is used and the kurtosis of the underlying population distribution increases, the KLI estimates become significantly different from zero at the 5% level of significance.However M1 does not produce results with a similar pattern.Only the estimates for kurtosis values of 8, 9 and 10 when skewness is equal to 0.5, are significantly different from zero at the 5% level (see Table 3).In small sample sizes such as 3 and 4, we observed that as kurtosis of less skewed underlying population distributions increases, the KLI estimates increase.
When sampling is done from distributions with lower skewness values such as 0, 0.5, and 1.0, some of the estimated values were small negative numbers near zero (see Tables 2 & 3).One of the reasons for this could be sampling errors because all these negative values are insignificantly different from zero.
Thus, these values can be considered as negligible positive values because KLI cannot be negative by definition.
Overall, the standard errors show that M3 gives estimates with much less variation than those of M1 for all the different parameter combinations, with a few exceptions.Thus M3 seems better than M1 for estimating KLI for our purpose.Consequently, we shall now interpret the results based on M3.
According to the Monte Carlo results, the estimated KLI values range between 0 and 0.124 when considering insignificant negative values (only 3 numbers) as zero.Among the KLI estimates which are significantly different from zero, the lowest value is 0.0042 whereas the highest value is 0.124.The lowest value occurs for a sample size of 5 when skewness is zero.If we can categorise this range to subranges such as higher or moderate values of 0.0042 to 0.124, and small values such as values less than 0.0042, we can discuss how well the true distribution converges to the normal distribution based on these low and high limits of estimates.The asymptotic normal approximation seems to be very reasonable for the distributions which have very small KLI values close to zero (values less than 0.0042).The reverse occurs when the KLI values are very high.In order to illustrate how well the true distributions are approximated by the normal distribution using our estimates of KLI, we can choose a reasonably appropriate value within the range of significant KLI estimates, as a threshold.The next question is, what should the threshold value be?
We observed that as the sample size increases, the values of the estimated KLI decline.Thus the lowest sample size, which is 3, gives the highest KLI for all kurtosis values.According to the results, as the kurtosis of the underlying population distribution increases, values of estimated KLI increase implying that when the underlying population distribution is away from the normal distribution, KLI increases.For the sample size of 3, the KLI estimates for kurtosis values of 3, 4, 5, 6, 7, 8 and 9 are -0.0001,0.0008, 0.0047, 0.0097, 0.0145, 0.0189 and 0.0229, respectively.Among these, only the KLI estimates for kurtosis values of 5, 6, 7, 8 and 9 are significantly different from zero.Among the significant values, 0.0145 is the one with the mid value of kurtosis which give significant KLI estimates.The average of significant KLI estimates is also nearly 0.0145.Therefore 0.0145 seems to be a good choice for the threshold value for KLI estimates.Thus we can use the following rule concerning the distributions of sample means:  KLI estimates < 0.0042  well approximated by the normal distribution. 0.0042 < KLI estimates  0.0145  reasonably approximated by the normal distribution.
 KLI > 0.0145  poorly approximated by the normal distribution.
We find there are small KLI estimates which are less than 0.0145, for sample sizes 30 and above for kurtosis value of 9 and 10 when data is generated from highly skewed distributions (see Table 4).But when data is generated from distributions with kurtosis of 11 and 12, the minimum sample sizes required for having low KLI estimates are 26 and 24, respectively.When skewness is 2, and as the kurtosis of the underlying population distribution increases, the minimum sample size required for a reasonably normal approximation seems to decrease.For example, for kurtosis in the range 9 -10, minimum sample size required seems equal to 30 whereas for kurtosis in the range 12 -15, this becomes 24 (see Tables 4 & 5).
Based on our results, sample sizes greater than 30 can be recommended for use of the asymptotic normal approximation in the CLT when sampling from skewed and leptokurtic or medium tailed distributions (see Table 4).However, sample sizes less than that also give a relatively good normal approximation when the population distribution's skewness is less than or equal to one.But as the skewness increases, the possibility of getting a good normal approximation for a small sample diminishes 7 . 7For brevity these and the following results are not reported.They are available from the corresponding author.
If we look at the behaviour of estimates with changes of kurtosis and skewness values, for leptokurtic distributions with small skewness values, even sample sizes of 3 -10 can be used for a reasonably good normal approximation.However, the sample mean of random samples taken from highly positive skewed distributions (for example, skewness of 2) does not have a good normal approximation compared to the others.Thus, for sample sizes such as 3 -20, the normal approximation cannot be recommended when sampling from such distributions because the divergence between the true distribution and the approximating normal distribution is comparatively high.When samples are taken from skewed distributions (for example, skewness of 1.5), sample sizes less than 10 might give poor normal approximations to the distributions of sample means.When sampling is done from asymmetric distributions 8 , we clearly see that the KLI values of the true distribution of the sample means with respect to the normal distribution, decreases and converges to zero as sample sizes increase.The results are justifiable due to the CLT.When a threshold value such as 0.0145 is chosen, then sample sizes higher than or equal to 14, give KLI estimates less than 0.0145.Therefore at least 18 observations should be used for the true distribution to be better approximated by the normal distribution, when sampling from an underlying population distribution with skewness of 1.5.When skewness is 2, a similar pattern in KLI estimates can be observed but the minimum sample sizes required for a better normal approximation is higher.For sample sizes greater than 6-8, almost all the KLI estimates are less than 0.0145 in the case of generating data from distributions with skewness of 1.Therefore, it seems that these distributions are reasonably approximated by the normal distribution for sample sizes greater than 8 for all the kurtosis values used in the experiments.
Based on the estimated KLI values, Table 6 summarises the minimum sample size needed for the true distribution of the sample mean to be reasonably approximated by the normal distribution, for particular choices of skewness and kurtosis values.It should be noted that these recommendations are made on the basis of the distributions used in this study.One should not assume that they extend to all distributions with these particular values of skewness and kurtosis.Obviously the shape of the underlying population distribution influences the rate at which a sampling distribution of means converges to a normal distribution.

CONCLUSION
This paper considers three nonparametric estimators (kernel, maximum entropy principle and Vasicek's entropy) of the KLI measure to investigate how well the true distribution of means of independent random samples are approximated by the approximating normal distribution in the context of the CLT.
For this study, a range of sample sizes were used and the samples were generated from Tukey's lambda distribution with different skewness and kurtosis values.Overall, the Vasicek's entropy performs better 8 The skewness values 1.5 -2 used in this paper can be considered as such asymmetric cases.
than the other methods in terms of estimating KLI for assessing asymptotic approximations.Based on this best method, we investigate how distributions affect convergence in the CLT and find which type of distributions give poor asymptotic approximations.As expected, the results suggest that the distribution of the sample mean better approximates the normal distribution as the sample size increases.We have also made some recommendations on minimum sample sizes required for an accurate normal approximation of the true distribution of the sample mean.
Our results indicate that the true distribution of the sample mean when the sample is taken from a highly skewed distribution better approximates the normal distribution as the thickness of the tail of the population distribution increases.In the range of kurtosis values considered, means of small samples generated from symmetric distributions are well approximated by the normal distribution.

Table 6 : Minimum sample size needed for the true distribution of the sample mean to be reasonably approximated by the normal distribution
Note: As noted in Section 3, not all combinations of skewness and kurtosis values are estimated, which explains the missing values.