Talk on DataDeps.jl and other foundational tools for datadriven science, presented at JuliaCon2018
Video
https://youtu.be/kSlQpzccRaI
Abstract The focus of this talk is DataDeps.jl – BinDeps
for Data – Repeatable Data Setup for Replicable Data Science. How to
manage data dependencies to ensure any script can be executed by others.
The secondary topic is what comes next: data ingestion, with a focus on
NLP, though that generalizes. Description This talk
will cover the fundamental process of getting from a dataset on a
web-server, into data in your program. Almost all empirical research
work is data driven. This is particularly true of any field that is
using machine learning. As such, setting up your data environment in a
repeatable and clean way is essential for producing replicable research.
Similarly, many packages have some requirement on data included to
function, for example WordNet.jl requires the WordNet database.
Deploying a package based on using an already trained machine learning
model requires downloading that model. This talk will primarily
focus on DataDeps.jl which allows for the automatic installation and
management of data dependencies. For researchers and package developers
DataDeps.jl solves 3 important issues:
Storage location: Where do I put it?
Should it be on the local disk (small) or the network file-store (slow)?
If I move it, I’m going to have to reconfigure things.
Can I keep it in git?
Redistribution: I didn’t create this this data
Am I allowed to redistribute it?
How will I give credit, and ensure the users know who the original creator was?
Replication: How can I be sure that someone running my code has the same data?
What if they download the wrong data, or extract it incorrectly?
What if it gets corrupted or modified?
On top of this: by allowing fully automate Data Dependency setup, end to end automated testing becomes possible.
To achieve this DataDeps.jl needs each data dependency to be declared.
This declaration requires information such as the name of the dataset,
it’s URLs, a checksum, and who to give credit to for its original
creation etc. I found myself copy-pasting that data from the websites. DataDepsGenerators.jl
is a package that can generate this code given a link to a supported
webpage describing. This makes it really easy to just grab someone
else’s published data, and depend upon it. Then DataDeps.jl will resolve
that dependency to get the data onto your machine. Once you’ve got
the data onto your machine, the final stage is to load it up into a
structure Julia can work with. For tabular data, julia has you well
covered with a number of packages like JuliaDB, DataFrames.jl and many
other supporting packages. MLDatasets.jl, uses DataDeps.jl as a backend,
provides specialised methods for accessing various commonly used
machine learning datasets. CorpusLoaders.jl provides a similar service
for natural language corpora. Corpora often have factors that differ
from other types of data.
They often require tokenisation to become usable, for which we use WordTokenizers.jl.
Tokenization increases the memory used: to decreases this we use InternedStrings.jl; and load them lazily via iterators.
To handle the hierarchical structure (Document, Paragraph, Sentence,
Word) of these iterators we introduce MultiResolutionIterators.jl.
Julia is excellent for data driven science, and this talk will help you
understand how you can handle your data in a more robust way.
Packages discussed in great detail:
DataDeps.jl: This manages Data Dependencies. (Most of the talk is on this)
DataDepsGenerators.jl: This converts URLs pointing webpages containing metadata, into code for DataDeps.jl
CorpusLoaders.jl: It is a data package building on roughly every other package mentioned here.
InternedStrings.jl For decreasing memory usage, and speeding up equality checks.
WordTokenizers.jl it is a natural language tokenization and string splitting package.
Packages mentioned:
MLDatasets.jl:
a package full of datasets, similar overall to CorpusLoaders.jl but
with some significant differences in philosophy and default assumptions.
WordNet.jl: the julia interface to the WordNet lexical resource.
Embeddings.jl: pretrained word-embeddings, loaded using DataDeps.jl