Portable Scalable Dask Compute on HPC with Singularity

Mr Benjamin Leighton1, Claire Trenham, Dr Ron Hoeke, Dr Paul Branson, Dr Blake Seers, Dougie Squire

1Csiro,

The python ecosystem provides a rich evolving resource for Data Science at scale. A baseline, consistent and readily shareable python environment helps teams collaborate, but maintaining and distributing such an environment is challenging. In HPC  (High Performance Compute) environments these issues are compounded by limited storage, permissions, and the desire to build python compatibility with HPC job runners to facilitate massively parallel compute.

Conda environments are slow to create and drift between established environments across users and create incompatibilities and confusion. We describe a bootstrapped approach using Singularity to establish a baseline python environment that contains useful common python packages for climate and ocean science, allows users to additionally install custom packages, includes HPC adapted Dask to allow users to readily spin up clusters, and optionally runs parameterized scripted Jupyter notebooks unattended.

Singularity Images can be quickly deployed by many users to provide a consistent starting environment. Scripts are provided that enable simplify starting Singularity Containers. Included custom Jupyter notebooks act as templates providing working default Dask clusters and example code. We discuss the architecture of the environment and applying it for processing large ocean datasets.


Biography:

Ben is a CSIRO data engineer and data scientist working across a bunch of science domains, data, and compute infrastructures. He works designing and building solutions on both cloud and on premise HPC systems and regularly finds himself struggling with large geospatial and temporal datasets, workflows, provenance, and scientific models.

Categories