figshare
Browse
ucgs_a_1512867_sm3326.pdf (350.12 kB)

A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

Download (350.12 kB)
journal contribution
posted on 2018-08-20, 18:19 authored by Benjamin S. Baumer

Many interesting datasets available on the Internet are of a medium size—too big to fit into a personal computer’s memory, but not so large that they would not fit comfortably on its hard disk. In the coming years, datasets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality. Supplementary material for this article is available online.

History

Usage metrics

    Journal of Computational and Graphical Statistics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC