The Advantages of Using Group Means in Estimating the Lorenz Curve and Gini Index From Grouped Data

A recent article proposed a histogram-based method for estimating the Lorenz curve and Gini index from grouped data that did not use the group means reported by government agencies. When comparing their method to one based on group means, the authors assume a uniform density in each grouping interval, which leads to an overestimate of the overall average income. After reviewing the additional information in the group means, it will be shown that as the number of groups increases, the bounds on the Gini index obtained from the group means become narrower. This is not necessarily true for the histogram method. Two simple interpolation methods using the group means are described and the accuracy of the estimated Gini index they yield and the histogram-based one are compared to the published Gini index for the 1967–2013 period. The average absolute errors of the estimated Gini index obtained from the two methods using group means are noticeably less than that of the histogram-based method. Supplementary materials for this article are available online. [Received August 2014. Revised September 2015.]


INTRODUCTION
The Lorenz curve and Gini index are used in a wide variety of areas, for example, economics (Sen 1973;Nygard and Sandstrom 1981;Atkinson 1983;Kleiber and Kotz 2003;Cowell 2011), genetics (Gianola, Perez-Enciso, and Toro 2003), summarizing insurance scores (Frees, Meyers, and Cummings 2011), equity and efficiency of the allocation process for donor kidneys (Massie et al. 2009), maximizing the benefit of public health (Gail 2009), and statistical analyses (Yitzhaki and Schechtman 2013). They were originally proposed to measure Merritt Lyon is PhD Student (E-mail: mlyon37@gwu.edu), Li C. Cheung is PhD Student (E-mail: licheung@gwmail.gwu.edu), and Joseph L. Gastwirth is Professor (E-mail: jlgast@gwu.edu), Department of Statistics, 739 Phillips Hall, George Washington University, Washington, DC, 20052. It is a pleasure to thank Mr. Brian Dumbacher for many helpful discussions. In particular, the research in Section 6 could not have been carried out without his invaluable knowledge of the income data. The authors thank the referees and editors for many useful suggestions that improved the article and Ms. Jie Cong for several helpful discussions.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/tas. the inequality of income and wealth, and government agencies throughout the world use the Gini index as their main summary measure. Gini indexes of less than 0.3, 0.3-0.399, 0.4-0.499, and 0.5 or greater correspond to low, medium, high, and very high income inequality, respectively (Conference Board of Canada 2011). To preserve the confidentiality of income and other sensitive data, government agencies often publish it in grouped form. The household income data from the Current Population Survey (U.S. Census Bureau 2011a) is a well-known example. A method for deriving upper and lower bounds on the Lorenz curve and Gini index from grouped data with provided group means was proposed by Gastwirth (1972) and refined by Mehran (1975), Gastwirth and Krieger (1975), Giorgi and Pallini (1987), and Silber (1990).
Recently, Tillé and Langel (2012) proposed the use of a histogram-based density to estimate the Lorenz curve and Gini index from grouped data. Their method does not require the group means, and they provided an example showing that their estimate of the Gini index of the 2010 U.S. household income distribution using data from the Current Population Survey organized into four selected groups is 0.4874. This estimate is greater than the lower bound 0.4414 yielded by the method in Gastwirth (1972). They reported that the formula for the upper bound in Gastwirth (1972) gives 0.5565 and noted that the difference in the estimates is far from negligible. However, in their calculation of the Gini bounds, Tillé and Langel (2012) did not use the reported group means, which enter into the formulas given in Section 2. Incorporating the improved bounds in Gastwirth (1972) for densities having a decreasing hazard rate in the tail yields an upper bound of 0.4964, which is noticeably less than 0.5565. Since the U.S. Census Bureau reports the income data in 42 groups along with the group means (U.S. Census Bureau 2011b), it is useful to see how accurately the various estimates of the Gini index approach the reported estimate, 0.469 (DeNavas-Walt, Proctor, and Smith 2011), as the number of groups increases. It will be seen that as the number of groups increases, the distance between the upper and lower bounds from the Gastwirth (1972) method decreases and the bounds always contain the value 0.469. On the other hand, the Gini index obtained from the histogram-based approach does not necessarily become more accurate as the number of groups increases; indeed, it is less than the lower bound obtained using the group means when the full dataset with 42 groups is used.
The concepts and notation are introduced in Section 2. The extra information in the group means is defined, and Tillé and Langel's histogram-based method is described. In Section 3, a linear interpolation method and the split histogram method for estimating the Lorenz curve and Gini index using the group means are described. They avoid making parametric assumptions that may be unreliable (Schader and Schmid 1994). Formulas for the estimated Lorenz curve and Gini index using the linear interpolation method are derived in Section 4. In Section 5, these methods are compared to the histogram approach of Tillé and Langel (2012). The full publicly available dataset for U.S. household income in 2010 is analyzed and regrouped into fewer intervals to illustrate how the bounds become more accurate as the number of groups increases. Mathematically, it is demonstrated that when a grouping interval is divided into two subgroups, the distance between the bounds derived using the group means can only decrease. Another issue that arises in comparing the methods is that Tillé and Langel (2012) arbitrarily set the largest income to $500,000. Assuming different values for the upper bound, which are consistent with the largest actual observation (over $1,000,000), severely affects the Gini index obtained from the histogram approach. In Section 6, it is seen how the Gini bounds can be used to choose more informative groupings.

Concepts
Given a population of values with distribution function F and mean μ, the Lorenz curve is defined by (Gastwirth 1971) and represents the share of the total held by the lowest 100p% of the distribution. The Gini index G equals twice the area between the line of equality h(p) = p and the Lorenz curve L(p): G also is the ratio of the mean difference to twice the mean: Here, equals E|X 1 − X 2 |, where X 1 and X 2 are iid copies of F. An alternative expression for is (Stuart and Ord 1987, p

Sample Survey Context
Consider a sample of n observations, x 1 , . . . , x n , from a large finite population, whose values are grouped into J intervals defined by a j −1 , a j , j = 1, . . . , J . For application in income data, the first group j = 1 is bounded below with a 0 > −∞, and the final group j = J is unbounded with a J = ∞. Let f j be the estimated proportion contained in group j. Then F j = j k=1f k is the estimated proportion less than a j , where F 0 = 0 andF J = 1. Letx j be the mean of group j. Then the sample mean isx = J k=1f jxj . Finally, let x C j = (a j −1 + a j )/2 denote the mid-point of interval j. Table 1 contains an example of grouped income data from the Historical Income Tables of the Current Population Survey (U.S. Census Bureau 2014). From this information, one can estimate the underlying density of the data and then apply the formulas in Section 2.1 to obtain estimates of the Lorenz curve and Gini index. Yntema (1933) showed that the mean difference can be expressed as

Estimating the Gini Index when Group Means are Reported
where f j = F (a j ) − F (a j −1 ) is the proportion in group j, μ j = a j a j −1 xdF (x)/f j is the mean of group j, and * j is the mean difference in group j. Ignoring the second term in (4), which assumes all the values in each group are equal, the Gini index is underestimated by the "grouping correction" (Goldsmith et al. 1954 Gastwirth (1972) showed that D is bounded above bȳ This yields lower and upper bounds for the Gini index: If one is willing to assume that the density function in the last interval is decreasing or has a decreasing hazard rate, Gastwirth (1972) provided tighter bounds. When group means are reported, the natural estimate of iŝ and the natural estimate ofD iŝD One can then estimate the lower and upper Gini bounds by jfk |x j −x k | and GU = GL +ˆD. (9) Cowell and Mehta (1982) and Needleman (1978) showed that the Gini index is accurately estimated by the linear combination, 1 3 GL + 2 3 GU.

The Additional Information in the Group Means
Denote the survival function t (x) = 1 − F (x). Gastwirth and Krieger (1975) showed that the area under t(x) over the interval a j −1 , a j is given by Suppose one is trying to determine F based on the values F (a j ) and μ j . When only the information contained in the F (a j ) is used, any cdf H satisfying H (a j ) = F (a j ), j = 1, . . . , J , fits the data. Using the additional information contained in the group means μ j , the possible cdfs H consistent with the data must also satisfy the area constraint (10) in each of the J intervals, that is, Krieger (1983) provided several illustrative examples of the gain in information provided by the group means.

The Histogram-Based Interpolation Method
Tillé and Langel (2012), hereafter T&L, proposed a histogram-based method for estimating the Lorenz Curve and Gini index that does not require the group means. Instead, they chose an arbitrary upper bound of $500,000 for the income distribution and assumed that the values in each interval follow a uniform distribution. Under this assumption, the mid-point x C j is the mean of group j, and the estimated overall mean income iŝ and the estimated density function iŝ The estimated Lorenz curve obtained from formula (1) is piecewise quadratic. The corresponding estimate of the Gini index isĜ T&L estimated lower and upper bounds for the Gini index by replacing the mean income and group means in (6) with the estimated overall mean incomeμ TL and group mid-points x C j , respectively.
The T&L method can be refined by fitting an exponential tail to the last income group using the frequencies in the final two intervals [a J −2 , a J −1 ) and [a J −1 , ∞). The exponential density is anchored at the second-to-last cut point where the parameters η and λ are estimated fromf J −1 andf J . The estimating equations are

A LINEARLY INTERPOLATED DENSITY USING THE GROUP MEANS
An interpolation method should account for the finite intervals and unbounded tail. A linear density was assumed for the finite intervals. Because most of the distributions, for example, Pareto (Arnold 2015) and lognormal, used to model the upper tail of the income distribution have a decreasing hazard rate and the exponential distribution provides a bound on the tail probabilities for such distributions (Barlow and Proschan 1965), the exponential distribution will be fit to the unbounded right tail.

Finite Intervals
For group j < J , a linear density is assumed: The estimated densityĥ j (x) must satisfy three constraints: (14) 3.1.1 Estimation Solving for α j and β j in (13) and (14) yieldŝ The sign ofβ j indicates whether the group mean is greater than the group mid-point, and the interceptα j is the usual histogram densityf j /(a j − a j −1 ) adjusted by a term that accounts for the slope. Only ifx j = x C j does (13) become the uniform density used by T&L.

Nonnegativity
Cowell (2011) warned readers that the above estimated density is not guaranteed to be nonnegative. Lemma 1 states that the density will be nonnegative as long as the mean is contained in the middle third of the interval: Proof. See the Appendix.

Split Histogram
An alternative method that ensures a nonnegative density is the "split histogram density" method given by Cowell and Mehta (1982) and Cowell (2011). Givenx j andf j , the split histogram density over a j −1 , a j iŝ When the conditions of Lemma 1 are not satisfied, Hoffman (1984) modified the linear interpolator by setting it equal to zero in an appropriate sub-region of an interval.

Unbounded Interval
The final interval [a J −1 , ∞) has unbounded support. Cowell and Mehta (1982) chose an upper bound B for their split histogram density method and explored the sensitivity of their results by replacing B by 3B/4 and 2B. An exponential density is assumed where the parameters η and λ are estimated fromf J andx J . The estimating equations are Figure 1 plots the linear interpolation, T&L histogram, and split histogram densities for the grouping selected by T&L. For all three methods, an exponential density is fit to the final unbounded interval. The linear interpolation and split histogram densities overlap in this interval. In a typical interval where the density is decreasing, the Lorenz curve of the split histogram method is slightly higher than that of the linear interpolator at the beginning of an interval, then drops below and catches up at the end of the interval.

ESTIMATES OF THE LORENZ CURVE AND GINI INDEX FOR THE LINEAR INTERPOLATION METHOD
All of the estimated densities in Section 3 yield estimates of the quantile function and cdf. In turn, an estimate of the Lorenz curve is obtained via (1), and an estimate of the Gini index is obtained via (2) and (3).

Estimate of the Quantile Function
Any value p ∈ [0, 1) will belong to one of the J intervals [F j −1 ,F j ) whereF 0 = 0 andF J = 1. The case of a finite interval with a linear density is considered first, followed by the case of the unbounded interval with an exponential tail.

Unbounded Interval
If p ∈ [F J −1 , 1), then the corresponding quantile x * of the fitted exponential satisfieŝ Solving for x * gives

Estimate of the Lorenz Curve
The estimate of the Lorenz curve is derived from (1) and the results in Section 4.1. See the Web Appendix for the formulas.

Estimate of the Gini Index
It is convenient to estimate the Gini index using Equations (2) and (3), which requires finding the estimated cdf. Given the linear density (13), the cdf is piecewise quadratic. For a j −1 ≤ x < a j with j < J , the cdf equalŝ and for the final unbounded interval, Using this cdf, one calculates the integralsÎ j = a j a j −1 xF (x)dF (x). The estimated mean difference is obtained from (3). For details see the Web Appendix. Using (2), the resulting estimate of the Gini index is given bŷ The difference between the estimated Gini bounds (9) on G always decreases when a group is split into two. Formally, as the number r of groups increases, one has Lemma 2. Define GL r and GU r to be the lower and upper bounds (9) on the Gini index, respectively, when the number of groups equals r and GL r+1 and GU r+1 to be the bounds when a group is split.
The inequalities are strict if the distribution has positive density in both of the groups created by the split.
Proof. See Web Appendix.

DATA ANALYSIS
Tables 2 and 3 contain estimates of the Gini index (12) derived from the following methods: 1. T&L's histogram-based method using two choices of upper bound on income: $500,000 and $1,000,000. 2. The modified T&L method using an exponential tail. 3. The linear interpolation method that uses an exponential tail. 4. The split histogram method using two choices of upper bound on income: $500,000 and $1,000,000. 5. The modified split histogram using an exponential tail. 6. 1 3 the Gini lower bound + 2 3 the Gini upper bound.

Comparison of the Estimates to the Reported Gini Index of 0.469
In Table 2, four representative income groupings of the 2010 U.S. Census Bureau data were examined for each method: 1. The four groups selected by T&L with cut points at $50,000, $100,000, and $200,000. 2. The historical six groups given in Table 1. 3. Twelve groups with cut points at $20,000 increments from $20,000 to $200,000 and at $250,000. 4. The full 42 groups used by the Census Bureau with cut points at $5000 increments from $5000 to $200,000 and at $250,000 (U.S. Census Bureau 2011b).

Comparison of the Estimates to the Reported Gini Index for the Historical Data
The accuracy of the different estimates was calculated on the inflation-adjusted historical data for the 1967-2013 time period. An extract of the findings, reported in the Web Appendix, is given in Table 3, where the average absolute difference (AAD) for all years is also given.

Discussion
For 2010, the Census Bureau estimates the Gini index as 0.469. Tillé and Langel reported an estimate of 0.4874 using four income groups and an assumed upper income bound of $500,000. This estimate is fairly close to that of the Census Bureau. However, the accuracy of the T&L method depends heavily on the assumed maximum income and the choice of groupings. Using an upper bound of $1,000,000, which is reasonable as the largest income was larger, the T&L estimate of the Gini index is 0.5448. Even using the full 42 groups available for the 2010 data, the T&L Gini estimate can vary consider- ably with the choice of upper bound. With the largest income assumed to be $500,000, the T&L estimate, 0.4639, of the Gini index is below the Gini lower bound, 0.4683, obtained using the group means. If the largest income is assumed to be $1,000,000, then the T&L estimate, 0.5015, of the Gini index is noticeably larger than the Gini upper bound, 0.4700, obtained using the group means. Although the incorporation of an exponential distribution in place of a uniform distribution in the final group clearly improves the Gini estimate from the T&L approach, in general the exponential tail does not ensure that the resulting estimates of the Gini index are greater than the lower bound based on group means. For the linear interpolation method, the conditions for Lemma 1 held for all intervals. The linear interpolation method yields accurate estimates even with only four groups (0.4705 using T&L's grouping). Table 2 shows that both the linear interpolation and split histogram methods converge to the Census Bureau's reported estimate of 0.469 very quickly as the number of groups increase. Indeed, they both are quite close to 0.469 when only the six groups given in Table 1 are used. The other methods using the group means yield similar results.
For the historical data, the AAD between each estimate and the published Gini index for the years 1967-2013 is about 0.001 for the methods that use the group means, while that of the original histogram-based method is 0.04. Incorporating an exponential tail into the histogram-based method reduces that to 0.008.

Current Census Grouping
Several authors (Gastwirth 1972;Mehran 1975;Aghevli and Mehran 1981;Davies and Shorrocks 1989) have used bounds on the Gini index to determine a grouping that achieves a desired degree of accuracy when estimating the Gini index from data grouped accordingly. For illustration, consider the current Census grouping, which uses 42 groups with cut points at $5000 increments from $5000 to $200,000 and at $250,000. This is a large number of groups, but the distribution of household income has a long right tail. The Census Bureau reports that the two highest income groups each contain more than 2% of households, while the groups just below them each contain less than 1% (U.S. Census Bureau 2011b). Moreover, the within-group mean differences * j of the highest income groups are large because the group intervals are much wider ($50,000 and openended) than the interval length ($5000) used to group incomes in most of the rest of the distribution. Consequently, the contribution of the uncertainty in the bounds of (5) on the overall mean difference and Gini index from the two highest groups will be much greater than that from any of the six groups in the $170,000 to $200,000 range.

Alternative Grouping
This insight into which groups contribute the most to the differenceD between the Gini bounds can be used to reducē D while keeping the number of groups at 42. It is logical to split the highest income groups with large reported proportionŝ f j and combine the groups just below them with smaller proportionsf j . The following changes to the current Census grouping were considered: 1. Combine the $180,000-$185,000 and $185,000-$190,000 groups 2. Combine the $190,000-$195,000 and $195,000-$200,000 groups 3. Split the $200,000-$250,000 group into $200,000-$225,000 and $225,000-$250,000 4. Split the $250,000+ group into $250,000-$400,000 and $400,000+ Public-use 2012 household income micro-data from the 2013 Current Population Survey are used to compare the groupings. The largest household income in the micro-data is more than $2 million, four times larger than the upper bound of $500,000 considered by T&L.
For the current and alternative groupings, GL,ˆD, and GU are calculated. The following estimates for the group proportions and means are used: where s j is the sample of households belonging to group j and w k and y k are, respectively, the weight HSUP WGT and household income HTOTVAL reported in the micro-data for household k. For the current Census grouping, some of these estimatesf j andx j differ slightly from the reported estimates because of possible top-coding and other adjustments to the micro-data.  Table 4 reports the Gini bounds for both groupings. The bounds for both groupings contain the Census Bureau's 2012 estimate of the Gini index, 0.477 (DeNavas-Walt, Proctor, and Smith 2013). The valueˆD for the alternative grouping is one half the value for the current Census grouping, which shows the potential of using the Gini bounds to aid in the choice of intervals for summarizing the income data.

SUPPLEMENTARY MATERIALS
The supplementary appendix contains details on estimating the Lorenz curve and Gini index using the linear interpolation method, the proof of Lemma 2 that showed that the difference between the Gini bounds decreases the number of groups increase, and gives a table reporting the comparison of the estimates of the Gini index obtained by the methods examined in the paper to the reported Gini index for the historical data.