PyData DataDepsGenerators.jl
Sebastin Santy
10.6084/m9.figshare.7227875.v1
https://figshare.com/articles/journal_contribution/PyData_DataDepsGenerators_jl/7227875
<div><p>It is a dream of every data scientist to get
hold of data on their plate without much hassles. This includes data for
a new set of experiments or data needed in order reproduce an existing
result. Vandewalle et al. (2009) distinguishes 6 degrees of
reproducibility for scientific code. To achieve either of the 2 highest
levels, requires that “The results can be easily reproduced by an
independent researcher with at most 15 min of user effort”. It is our
experience that one can often expend much of that time just on setting
up the data. This involves reading the instructions, locating the
download link, transferring it to the right location, extracting an
archive, and identifying how to inform the script as to where the data
is located. These tasks are automatable therefore should be automated;
to save user time, and remove the opportunity for mistakes, as per the
key practice identified by Wilson et al. (2014) “let the computer do the
work”.</p><p><br></p>
<p>DataDeps.jl is a library for the Julia programming language, which
helps beat the exact same cause. It uses a registration block, a chunk
of julia code, which describes where the data can be downloaded, who
created it, what the terms and conditions for its use are, etc. The urls
retrieved from these blocks aid in downloading the required data for
running the experiment. It can be pointed out that creating a
registration block can be a tedious task, but there exists a support
package DataDepsGenerators.jl, which covers the most popular data
repositories.</p><p><br></p>
<p>At present, DataDepsGenerators.jl supports UCI ML, GitHub and DataOne
repositories which currently supports a large no. of datasets. UCI ML
provides around 436 commonly used datasets while DataOne a whopping 800k
datasets with over 46 TB of content. Future endeavours will bring
support for many other data repositories (like CKAN, OAI-PMH, DataCite
DOIs) eventually placing almost all of the open research data of the
world at your fingertips.</p></div>
2018-10-19 04:40:27
data management systems
Knowledge Representation and Machine Learning