Hands-on Tutorial Deploying Kubernetes and JupyterHub on Jetstream
datasetposted on 26.09.2018 by Andrea Zonca
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
Jupyter Notebooks have become a mainstream tool for interactive computing in every field of science. Being a web-based platform, they are naturally suited to serve as companion application to Science Gateways: a scientist can use a Jupyter Notebook in their browser to pre-process inputs, launch a job on the Science Gateway via web API and then access, analyze, plot and postprocess the job outputs, without ever worrying about setting up and keeping updated their software environment.
The JupyterHub project provides a multiple-user platform for Jupyter Notebooks and it is very easy to install and configure on a single server. However, when we need to provide computational resources to a large pool of users, we need to distribute the users on a cluster of machines, the best way to achieve scalability is thanks to the container orchestration platform Kubernetes.
In this tutorial we will work through the installation of Kubernetes on a set of Jetstream Virtual Machines, setup persistent storage and install a bare-bone JupyterHub deployment using the zero-to-jupyterhub recipe provided by the Jupyter team.
Then we will customize the setup configuring authentication (XSEDE, Globus or Github), choosing our preferred software environment for the users via Docker.
Finally we will show how to execute computational jobs, either interfacing with the web APIs of a “test gateway” to submit jobs or launching a pool of workers on Kubernetes and execute a distributed computation (using dask).