Machine Learning eResearch Platform (MLeRP) – How serving a subcommunity creates different design decisions

Mr. Mitchell Hargreaves1, Doctor Chris Hines1, Doctor Slava Kitaeff1, Miss Kiowa Scott-Hurley2, Doctor Oliver Cairncross3

1Monash University, Melbourne, Australia, 2Defence Science Technology Group, Melbourne, Australia, 3University of Queensland, Brisbane, Australia

Biography:

Mitchell Hargreaves is experienced with training and deploying deep learning models as well as building data engineering pipelines. He is passionate about reducing barriers of entry and making these tools more available for all. He is the primary developer and system administrator for MLeRP.

Abstract:

Machine Learning (ML) development involves code development and debugging, model training, labelling, and inference. Batch job submission systems like those on HPC clusters can serve the user's needs well if the problem is large and well-defined. Still, the development and debugging phase requires interactivity, driving users to Jupyter notebooks. Getting such interactivity on a large system tends to lead to poor GPU utilisation and resource availability.

The Machine Learning eResearch Platform (MLeRP) was developed to address this niche by offering a premium Jupyter notebook experience backed by the power of a GPU cluster, persistent storage, and full control over their software environment. MLeRP was designed to take advantage of the Dask framework, allowing users to decouple a Jupyter notebook from a GPU in the cluster rather than only attaching it as a reservation. This means that we can create more user seats and have the capacity to serve more users than otherwise possible. Such flexibility provides users with the ability to offload to differently sized GPU slices or even CPU clusters from within the same notebook for different stages of their pipeline. The MLeRP service also supports batch job submission via a web portal application when they are ready for training and transitioning their code to a more classical GPU HPC environment.

We will discuss how users have used MLeRP as a stepping stone, teaching them cluster concepts when they’re ready and preparing them for when they eventually outgrow our platform and scale to HPC.

 

Categories