JuliaCon2018 DataDeps.jl.pdf

White, Lyndon

doi:10.6084/m9.figshare.6949145.v1

JuliaCon2018 DataDeps.jl.pdf

presentation

posted on 2018-08-09, 10:38 authored by Lyndon WhiteLyndon White

Talk on DataDeps.jl and other foundational tools for datadriven science, presented at JuliaCon2018

Video

https://youtu.be/kSlQpzccRaI

Abstract
The focus of this talk is DataDeps.jl – BinDeps for Data – Repeatable Data Setup for Replicable Data Science. How to manage data dependencies to ensure any script can be executed by others. The secondary topic is what comes next: data ingestion, with a focus on NLP, though that generalizes.
Description
This talk will cover the fundamental process of getting from a dataset on a web-server, into data in your program. Almost all empirical research work is data driven. This is particularly true of any field that is using machine learning. As such, setting up your data environment in a repeatable and clean way is essential for producing replicable research. Similarly, many packages have some requirement on data included to function, for example WordNet.jl requires the WordNet database. Deploying a package based on using an already trained machine learning model requires downloading that model.
This talk will primarily focus on DataDeps.jl which allows for the automatic installation and management of data dependencies. For researchers and package developers DataDeps.jl solves 3 important issues:

Storage location: Where do I put it?
- Should it be on the local disk (small) or the network file-store (slow)?
- If I move it, I’m going to have to reconfigure things.
- Can I keep it in git?
Redistribution: I didn’t create this this data
- Am I allowed to redistribute it?
- How will I give credit, and ensure the users know who the original creator was?
Replication: How can I be sure that someone running my code has the same data?
- What if they download the wrong data, or extract it incorrectly?
- What if it gets corrupted or modified?

On top of this: by allowing fully automate Data Dependency setup, end to end automated testing becomes possible.
To achieve this DataDeps.jl needs each data dependency to be declared. This declaration requires information such as the name of the dataset, it’s URLs, a checksum, and who to give credit to for its original creation etc. I found myself copy-pasting that data from the websites. DataDepsGenerators.jl is a package that can generate this code given a link to a supported webpage describing. This makes it really easy to just grab someone else’s published data, and depend upon it. Then DataDeps.jl will resolve that dependency to get the data onto your machine.
Once you’ve got the data onto your machine, the final stage is to load it up into a structure Julia can work with. For tabular data, julia has you well covered with a number of packages like JuliaDB, DataFrames.jl and many other supporting packages. MLDatasets.jl, uses DataDeps.jl as a backend, provides specialised methods for accessing various commonly used machine learning datasets. CorpusLoaders.jl provides a similar service for natural language corpora. Corpora often have factors that differ from other types of data.

They often require tokenisation to become usable, for which we use WordTokenizers.jl.
Tokenization increases the memory used: to decreases this we use InternedStrings.jl; and load them lazily via iterators.
To handle the hierarchical structure (Document, Paragraph, Sentence, Word) of these iterators we introduce MultiResolutionIterators.jl.

Julia is excellent for data driven science, and this talk will help you understand how you can handle your data in a more robust way.

Packages discussed in great detail:

DataDeps.jl: This manages Data Dependencies. (Most of the talk is on this)
DataDepsGenerators.jl: This converts URLs pointing webpages containing metadata, into code for DataDeps.jl
CorpusLoaders.jl: It is a data package building on roughly every other package mentioned here.

Packages discussed in a little detail:

MultiResolutionIterators.jl it is the core of having a good API for CorpusLoaders.
InternedStrings.jl For decreasing memory usage, and speeding up equality checks.
WordTokenizers.jl it is a natural language tokenization and string splitting package.

Packages mentioned:

MLDatasets.jl: a package full of datasets, similar overall to CorpusLoaders.jl but with some significant differences in philosophy and default assumptions.
WordNet.jl: the julia interface to the WordNet lexical resource.
Embeddings.jl: pretrained word-embeddings, loaded using DataDeps.jl
MD5.jl and SHA.jl: for checksums for DataDeps.jl
PyDrive.jl: Proof of concept for using DataDeps.jl with private GoogleDrive files.

This work is partially funded by Australian Research Council Grants DP150102405 and LP110100050.