Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era

Lee, Jarod Y. L.; Brown, James J.; M. Ryan, Louise

doi:10.6084/m9.figshare.4445795.v2

utas_a_1255659_sm1566.pdf (221.45 kB)

Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era

Version 2 2017-03-09, 18:20

Version 1 2016-12-15, 18:25

journal contribution

posted on 2017-03-09, 18:20 authored by Jarod Y. L. Lee, James J. Brown, Louise M. Ryan

The big data era demands new statistical analysis paradigms, since traditional methods often break down when datasets are too large to fit on a single desktop computer. Divide and Recombine (D&R) is becoming a popular approach for big data analysis, where results are combined over subanalyses performed in separate data subsets. In this article, we consider situations where unit record data cannot be made available by data custodians due to privacy concerns, and explore the concept of statistical sufficiency and summary statistics for model fitting. The resulting approach represents a type of D&R strategy, which we refer to as summary statistics D&R; as opposed to the standard approach, which we refer to as horizontal D&R. We demonstrate the concept via an extended Gamma–Poisson model, where summary statistics are extracted from different databases and incorporated directly into the fitting algorithm without having to combine unit record data. By exploiting the natural hierarchy of data, our approach has major benefits in terms of privacy protection. Incorporating the proposed modelling framework into data extraction tools such as TableBuilder by the Australian Bureau of Statistics allows for potential analysis at a finer geographical level, which we illustrate with a multilevel analysis of the Australian unemployment data. Supplementary materials for this article are available online.